Stop trying to do it all alone, add Kit to your team. Learn more.
Planning in Bets: Risk Mitigation at Scale

Planning in Bets: Risk Mitigation at Scale

What do you do with a finite amount of time to deal with an infinite number of things that can go wrong? This post breaks down a high-level risk mitigation process into four questions that can be applied to nearly any scenario in order to help you make the best use of your time and resources available.

Continue reading

Using Server Sent Events to Simplify Real-time Streaming at Scale

Using Server Sent Events to Simplify Real-time Streaming at Scale

When building any kind of real-time data application, trying to figure out how to send messages from the server to the client (or vice versa) is a big part of the equation. Over the years, various communication models have popped up to handle server-to-client communication, including Server Sent Events (SSE). 

SSE is a unidirectional server push technology that enables a web client to receive automatic updates from a server via an HTTP connection. With SSE data delivery is quick and simple because there’s no periodic polling, so there’s no need to temporarily stage data.

This was a perfect addition to a real-time data visualization product Shopify ships every year—our Black Friday Cyber Monday (BFCM) Live Map. 

Our 2021 Live Map system was complex and used a polling communication model that wasn’t well suited. While this system had 100 percent uptime, it wasn't without its bottlenecks. We knew we could improve performance and data latency.

Below, we’ll walk through how we implemented an SSE server to simplify our BFCM Live Map architecture and improve data latency. We’ll discuss choosing the right communication model for your use case, the benefits of SSE, and code examples for how to implement a scalable SSE server that’s load-balanced with Nginx in Golang.  

Choosing a Real-time Communication Model

First, let’s discuss choosing how to send messages. When it comes to real-time data streaming, there are three communication models:

  1. Push: This is the most real-time model. The client opens a connection to the server and that connection remains open. The server pushes messages and the client waits for those messages. The server manages a registry of connected clients to push data to. The scalability is directly related to the scalability of this registry.
  2. Polling: The client makes a request to the server and gets a response immediately, whether there's a message or not. This model can waste bandwidth and resources when there are no new messages. While this model is the easiest to implement, it doesn’t scale well. 
  3. Long polling: This is a combination of the two models above. The client makes a request to the server, but the connection is kept open until a response with data is returned. Once a response with new data is returned, the connection is closed. 

No model is better than the other, it really depends on the use case. 

Our use case is the Shopify BFCM Live Map, a web user interface that processes and visualizes real-time sales made by millions of Shopify merchants over the BFCM weekend. The data we’re visualizing includes:

  • Total sales per minute 
  • Total number of orders per minute 
  • Total carbon offset per minute 
  • Total shipping distance per minute 
  • Total number of unique shoppers per minute 
  • A list of latest shipping orders
  • Trending products
Shopify BFCM Live Map 2022 Frontend
Shopify’s 2022 BFCM Live Map frontend

BFCM is the biggest data moment of the year for Shopify, so streaming real-time data to the Live Map is a complicated feat. Our platform is handling millions of orders from our merchants. To put that scale into perspective, during BFCM 2021 we saw 323 billion rows of data ingested by our ingestion service. 

For the BFCM Live Map to be successful, it requires a scalable and reliable pipeline that provides accurate, real-time data in seconds. A crucial part of that pipeline is our server-to-client communication model. We need something that can handle both the volume of data being delivered, and the load of thousands of people concurrently connecting to the server. And it needs to do all of this quickly.

Our 2021 BFCM Live Map delivered data to a presentation layer via WebSocket. The presentation layer then deposited data in a mailbox system for the web client to periodically poll, taking (at minimum) 10 seconds. In practice, this worked but the data had to travel a long path of components to be delivered to the client.

Data was provided by a multi-component backend system consisting of a Golang based application (Cricket) using a Redis server and a MySQL database. The Live Map’s data pipeline consisted of a multi-region, multi-job Apache Flink based application. Flink processed source data from Apache Kafka topics and Google Cloud Storage (GCS) parquet-file enrichment data to produce into other Kafka topics for Cricket to consume.

Shopify BFCM 2021 Backend Architecture
Shopify’s 2021 BFCM globe backend architecture

While this got the job done, the complex architecture caused bottlenecks in performance. In the case of our trending products data visualization, it could take minutes for changes to become available to the client. We needed to simplify in order to improve our data latency. 

As we approached this simplification, we knew we wanted to deprecate Cricket and replace it with a Flink-based data pipeline. We’ve been investing in Flink over the past couple of years, and even built our streaming platform on top of it—we call it Trickle. We knew we could leverage these existing engineering capabilities and infrastructure to streamline our pipeline. 

With our data pipeline figured out, we needed to decide on how to deliver the data to the client. We took a look at how we were using WebSocket and realized it wasn’t the best tool for our use case.

Server Sent Events Versus WebSocket

WebSocket provides a bidirectional communication channel over a single TCP connection. This is great to use if you’re building something like a chat app, because both the client and the server can send and receive messages across the channel. But, for our use case, we didn’t need a bidirectional communication channel. 

The BFCM Live Map is a data visualization product so we only need the server to deliver data to the client. If we continued to use WebSocket it wouldn’t be the most streamlined solution. SSE on the other hand is a better fit for our use case. If we went with SSE, we’d be able to implement:

  • A secure uni-directional push: The connection stream is coming from the server and is read-only.
  • A connection that uses ubiquitously familiar HTTP requests: This is a benefit for us because we were already using a ubiquitously familiar HTTP protocol, so we wouldn’t need to implement a special esoteric protocol.
  • Automatic reconnection: If there's a loss of connection, reconnection is automatically retried after a certain amount of time.

But most importantly, SSE would allow us to remove the process of retrieving, processing, and storing data on the presentation layer for the purpose of client polling. With SSE, we would be able to push the data as soon as it becomes available. There would be no more polls and reads, so no more delay. This, paired with a new streamlined pipeline, would simplify our architecture, scale with peak BFCM volumes and improve our data latency. 

With this in mind, we decided to implement SSE as our communication model for our 2022 Live Map. Here’s how we did it.

Implementing SSE in Golang

We implemented an SSE server in Golang that subscribes to Kafka topics and pushes the data to all registered clients’ SSE connections as soon as it’s available. 

Shopify BFCM Live Map 2022 Frontend
Shopify’s 2022 BFCM Live Map backend architecture with SSE server

A real-time streaming Flink data pipeline processes raw Shopify merchant sales data from Kafka topics. It also processes periodically-updated product classification enrichment data on GCS in the form of compressed Apache Parquet files. These are then computed into our sales and trending product data respectively and published into Kafka topics.

Here’s a code snippet of how the server registers an SSE connection:

Subscribing to the SSE endpoint is simple with the EventSource interface. Typically, client code creates a native EventSource object and registers an event listener on the object. The event is available in the callback function:

When it came to integrating the SSE server to our frontend UI, the UI application was expected to subscribe to an authenticated SSE server endpoint to receive data. Data being pushed from the server to client is publicly accessible during BFCM, but the authentication enables us to control access when the site is no longer public. Pre-generated JWT tokens are provided to the client by the server that hosts the client for the subscription. We used the open-sourced EventSourcePolyfill  implementation to pass an authorization header to the request:

Once subscribed, data is pushed to the client as it becomes available. Data is consistent with the SSE format, with the payload being a JSON parsable by the client.

Ensuring SSE Can Handle Load 

Our 2021 system struggled under a large number of requests from user sessions at peak BFCM volume due to the message bus bottleneck. We needed to ensure our SSE server could handle our expected 2022 volume. 

With this in mind, we built our SSE server to be horizontally scalable with the cluster of VMs sitting behind Shopify’s NGINX load-balancers. As the load increases or decreases, we can elastically expand and reduce our cluster size by adding or removing pods. However, it was essential that we determined the limit of each pod so that we could plan our cluster accordingly.

One of the challenges of operating an SSE server is determining how the server will operate under load and handle concurrent connections. Connections to the client are maintained by the server so that it knows which ones are active, and thus which ones to push data to. This SSE connection is implemented by the browser, including the retry logic. It wouldn’t be practical to open tens of thousands of true browser SSE connections. So, we need to simulate a high volume of connections in a load test to determine how many concurrent users one single server pod can handle. By doing this, we can identify how to scale out the cluster appropriately.

We opted to build a simple Java client that can initiate a configurable amount of SSE connections to the server. This Java application is bundled into a runnable Jar that can be distributed to multiple VMs in different regions to simulate the expected number of connections. We leveraged the open-sourced okhttp-eventsource library to implement this Java client.

Here’s the main code for this Java client:

Did SSE Perform Under Pressure?

With another successful BFCM in the bag, we can confidently say that implementing SSE in our new streamlined pipeline was the right move. Our BFCM Live Map saw 100 percent uptime. As for data latency in terms of SSE, data was delivered to clients within milliseconds of its availability. This was much improved from the minimum 10 second poll from our 2021 system. Overall, including the data processing in our Flink data pipeline, data was visualized on the BFCM’s Live Map UI within 21 seconds of its creation time. 

We hope you enjoyed this behind the scenes look at the 2022 BFCM Live Map and learned some tips and tricks along the way. Remember, when it comes to choosing a communication model for your real-time data product, keep it simple and use the tool best suited for your use case.

Bao is a Senior Staff Data Engineer who works on the Core Optimize Data team. He's interested in large-scale software system architecture and development, big data technologies and building robust, high performance data pipelines.

Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Want to help us scale and make commerce better for everyone? Join our team.

Continue reading

How to Export Datadog Metrics for Exploration in Jupyter Notebooks

How to Export Datadog Metrics for Exploration in Jupyter Notebooks

"Is there a way to extract Datadog metrics in Python for in-depth analysis?" 

This question has been coming up a lot at Shopify recently, so I thought detailing a step-by-step guide might be useful for anyone going down this same rabbit hole.

Follow along below to learn how to extract data from Datadog and build your analysis locally in Jupyter Notebooks.

Why Extract Data from Datadog?

As a quick refresher, Datadog is a monitoring and security platform for cloud applications, used to find issues in your platform, monitor the status of different services, and track the health of an infrastructure in general. 

So, why would you ever need Datadog metrics to be extracted?

There are two main reasons why someone may prefer to extract the data locally rather than using Datadog:

  1. Limitation of analysis: Datadog has a limited set of visualizations that can be built and it doesn't have the tooling to perform more complex analysis (e.g. building statistical models). 
  2. Granularity of data: Datadog dashboards have a fixed width for the visualizations, which means that checking metrics across a larger time frame will make the metric data less granular. For example, the below image shows a Datadog dashboard capturing a 15 minute span of activity, which generates metrics on a 1 second interval:
Datadog dashboard granularity of data over 15 minutes
Datadog dashboard showing data over the past 15 minutes

Comparatively, the below image shows a Datadog dashboard that captures a 30 day span of activity, which generates metrics on a 2 hour interval:

Datadog dashboard granularity of data over 30 days
Datadog dashboard showing data over the past 30 days

As you can see, Datadog visulaizes an aggregated trend in the 2 hour window, which means it smoothes (hides) any interesting events. For those reasons, someone may prefer to extract the data manually from Datadog to run their own analysis.

How to Extract Data and Build Your Own analysis

For the purposes of this blog, we’ll be running our analysis in Jupyter notebooks. However, feel free to use your own preferred tool for working with Python.

Datadog has a REST API which we’ll use to extract data from.

In order to extract data from Datadog's API, all you need are two things :

  1. API credentials: You’ll need credentials (an API key and an APP key) to interact with the datadog API. 
  2. Metric query: You need a query to execute in Datadog. For the purposes of this blog, let’s say we wanted to track the CPU utilization over time.

Once you have the above two requirements sorted, you’re ready to dive into the data.

Step 1: Initiate the required libraries and set up your credentials for making the API calls:

 

Step 2: Specify the parameters for time-series data extraction. Below we’re setting the time period from Tuesday, November 22, 2022 at 16:11:49 GMT to Friday, November 25, 2022 at 16:11:49 GMT:

One thing to keep in mind is that Datadog has a rate limit of API requests. In case you face rate issues, try increasing the “time_delta” in the query above to reduce the number of requests you make to the Datadog API.

Step 3: Run the extraction logic. Take the start and the stop timestamp and split them into buckets of width = time_delta

An example of bucketing start and stop timestamp
An example of taking the start and the stop timestamp and splitting them into buckets of width = time_delta

Next, make calls to the Datadog API for the above bucketed time windows in a for loop. For each call, append the data you extracted for bucketed time frames to a list.

Lastly, convert the lists to a dataframe and return it:

 

Step 4: Voila, you have the data! Looking at the below mock data table, this data will have more granularity compared to what is shown in Datadog.

Granularity of data after exporting from Datadog
Example of the granularity of data exported from Datadog

Now, we can use this to visualize data using any tool we want. For example, let’s use seaborn to look at the distribution of the system’s CPU utilization using KDE plots:

 

As you can see below, this visualization provides a deeper insight.

Data visualization in seaborn
Visualizing the data we pulled from Datadog in seaborn to look at the distribution using KDE plots

And there you have it. A super simple way to extract data from Datadog for exploration in Jupyter notebooks.

Kunal is a data scientist on the Shopify ProdEng data science team, working out of Niagara Falls, Canada. His team helps make Shopify’s platform performant, resilient and secure. In his spare time, Kunal enjoys reading about tech stacks, working on IoT devices and spending time with his family.

Are you passionate about solving data problems and eager to learn more about Shopify? Check out openings on our careers page.

Continue reading

Caching Without Marshal Part 2: The Path to MessagePack

Caching Without Marshal Part 2: The Path to MessagePack

In part one of Caching Without Marshal, we dove into the internals of Marshal, Ruby’s built-in binary serialization format. Marshal is the black box that Rails uses under the hood to transform almost any object into binary data and back. Caching, in particular, depends heavily on Marshal: Rails uses it to cache pretty much everything, be it actions, pages, partials, or anything else.

Marshal’s magic is convenient, but it comes with risks. Part one presented a deep dive into some of the little-documented internals of Marshal with the goal of ultimately replacing it with a more robust cache format. In particular, we wanted a cache format that would not blow up when we shipped code changes.

Part two is all about MessagePack, the format that did this for us. It’s a binary serialization format, and in this sense it’s similar to Marshal. Its key difference is that whereas Marshal is a Ruby-specific format, MessagePack is generic by default. There are MessagePack libraries for Java, Python, and many other languages.

You may not know MessagePack, but if you’re using Rails chances are you’ve got it in your Gemfile because it’s a dependency of Bootsnap.

The MessagePack Format

On the surface, MessagePack is similar to Marshal: just replace .dump with .pack and .load with .unpack. For many payloads, the two are interchangeable.

Here’s an example of using MessagePack to encode and decode a hash:

MessagePack supports a set of core types that are similar to those of Marshal: nil, integers, booleans, floats, and a type called raw, covering strings and binary data. It also has composite types for array and map (that is, a hash).

Notice, however, that the Ruby-specific types that Marshal supports, like Object and instance variable, aren’t in that list. This isn’t surprising since MessagePack is a generic format and not a Ruby format. But for us, this is a big advantage since it’s exactly the encoding of Ruby-specific types that caused our original problems (recall the beta flag class names in cache payloads from Part One).

Let’s take a closer look at the encoded data of Marshal and MessagePack. Suppose we encode a string "foo" with Marshal, this is what we get:

Visual representation encoded data of Marshall.dump("foo") =  0408 4922 0866 6f6f 063a 0645 54
Encoded data from Marshal for Marshall.dump("foo")

Let’s look at the payload: 0408 4922 0866 6f6f 063a 0645 54. We see that the payload "foo" is encoded in hex as 666f6f and prefixed by 08 representing a length of 3 (f-o-o). Marshal wraps this string payload in a TYPE_IVAR, which as mentioned in part 1 is used to attach instance variables to types that aren’t strictly implemented as objects, like strings. In this case, the instance variable (3a 0645) is named :E. This is a special instance variable used by Ruby to represent the string’s encoding, which is T (54) for true, that is, this is a UTF-8 encoded string. So Marshal uses a Ruby-native idea to encode the string’s encoding.

In MessagePack, the payload (a366 6f6f) is much shorter:

Visual representation of encoded data MessagePack(“foo") = 0408 4922 0866 6f6f 063a 0645 54
Encoded data from MessagePack for MessagePack.pack("foo")

The first thing you’ll notice is that there isn’t an encoding. MessagePack’s default encoding is UTF-8, so there’s no need to include it in the payload. Also note that the payload type (10100011), String, is encoded together with its length: the bits 101 encodes a string of less than 31 bytes, and 00011 says the actual length is 3 bytes. Altogether this makes for a very compact encoding of a string.

Extension Types

After deciding to give MessagePack a try, we did a search for Rails.cache.write and Rails.cache.read in the codebase of our core monolith, to figure out roughly what was going into the cache. We found a bunch of stuff that wasn’t among the types MessagePack supported out of the box.

Luckily for us, MessagePack has a killer feature that came in handy: extension types. Extension types are custom types that you can define by calling register_type on an instance of MessagePack::Factory, like this:

An extension type is made up of the type code (a number from 0 to 127—there’s a maximum of 128 extension types), the class of the type, and a serializer and deserializer, referred to as packer and unpacker. Note that the type is also applied to subclasses of the type’s class. Now, this is usually what you want, but it’s something to be aware of and can come back to bite you if you’re not careful.

Here’s the Date extension type, the simplest of the extension types we use in the core monolith in production:

As you can see, the code for this type is 3, and its class is Date. Its packer takes a date and extracts the date’s year, month, and day. It then packs them into the format string "s< C C" using the Array#pack method with the year to a 16 bit signed integer, and the month and day to 8-bit unsigned integers. The type’s unpacker goes the other way: it takes a string and, using the same format string, extracts the year, month, and day using String#unpack, then passes them to Date.new to create a new date object.

Here’s how we would encode an actual date with this factory:

Converting the result to hex, we get d603 e607 0909 that corresponds to the date (e607 0909) prefixed by the extension type (d603):

Visual breakdown of hex results d603 e607 0909
Encoded date from the factory

As you can see, the encoded date is compact. Extension types give us the flexibility to encode pretty much anything we might want to put into the cache in a format that suits our needs.

Just Say No

If this were the end of the story, though, we wouldn’t really have had enough to go with MessagePack in our cache. Remember our original problem: we had a payload containing objects whose classes changed, breaking on deploy when they were loaded into old code that didn’t have those classes defined. In order to avoid that problem from happening, we need to stop those classes from going into the cache in the first place.

We need MessagePack, in other words, to refuse encoding any object without a defined type, and also let us catch these types so we can follow up. Luckily for us, MessagePack does this. It’s not the kind of “killer feature” that’s advertised as such, but it’s enough for our needs.

Take this example, where factory is the factory we created previously:

If MessagePack were to happily encode this—without any Object type defined—we’d have a problem. But as mentioned earlier, MessagePack doesn’t know Ruby objects by default and has no way to encode them unless you give it one.

So what actually happens when you try this? You get an error like this:

NoMethodError: undefined method `to_msgpack' for <#Object:0x...>

Notice that MessagePack traversed the entire object, through the hash, into the array, until it hit the Object instance. At that point, it found something for which it had no type defined and basically blew up.

The way it blew up is perhaps not ideal, but it’s enough. We can rescue this exception, check the message, figure out it came from MessagePack, and respond appropriately. Critically, the exception contains a reference to the object that failed to encode. That’s information we can log and use to later decide if we need a new extension type, or if we are perhaps putting things into the cache that we shouldn’t be.

The Migration

Now that we’ve looked at Marshal and MessagePack, we’re ready to explain how we actually made the switch from one to the other.

Making the Switch

Our migration wasn’t instantaneous. We ran with the two side-by-side for a period of about six months while we figured out what was going into the cache and which extension types we needed. The path of the migration, however, was actually quite simple. Here’s the basic step-by-step process:

  1. First, we created a MessagePack factory with our extension types defined on it and used it to encode the mystery object passed to the cache (the puzzle piece in the diagram below).
  2. If MessagePack was able to encode it, great! We prefixed a version byte prefix that we used to track which extension types were defined for the payload, and then we put the pair into the cache.
  3. If, on the other hand, the object failed to encode, we rescued the NoMethodError which, as mentioned earlier, MessagePack raises in this situation. We then fell back to Marshal and put the Marshal-encoded payload into the cache. Note that when decoding, we were able to tell which payloads were Marshal-encoded by their prefix: if it’s 0408 it’s a Marshal-encoded payload, otherwise it’s MessagePack.
Path of the migration
The migration three step process

The step where we rescued the NoMethodError was quite important in this process since it was where we were able to log data on what was actually going into the cache. Here’s that rescue code (which of course no longer exists now since we’re fully migrated to MessagePack):

As you can see, we sent data (including the class of the object that failed to encode) to both logs and StatsD. These logs were crucial in flagging the need for new extension types, and also in signaling to us when there were things going into the cache that shouldn’t ever have been there in the first place.

We started the migration process with a small set of default extension types which Jean Boussier, who worked with me on the cache project, had registered in our core monolith earlier for other work using MessagePack. There were five:

  • Symbol (offered out of the box in the messagepack-ruby gem. It just has to be enabled)
  • Time
  • DateTime
  • Date (shown earlier)
  • BigDecimal

These were enough to get us started, but they were certainly not enough to cover all the variety of things that were going into the cache. In particular, being a Rails application, the core monolith serializes a lot of records, and we needed a way to serialize those records. We needed an extension type for ActiveRecords::Base.

Encoding Records

Records are defined by their attributes (roughly, the values in their table columns), so it might seem like you could just cache them by caching their attributes. And you can.

But there’s a problem: records have associations. Marshal encodes the full set of associations along with the cached record. This ensures that when the record is deserialized, the loaded associations (those that have already been fetched from the database) will be ready to go without any extra queries. An extension type that only caches attribute values, on the other hand, needs to make a new query to refetch those associations after coming out of the cache, making it much more inefficient.

So we needed to cache loaded associations along with the record’s attributes. We did this with a serializer called ActiveRecordCoder. Here’s how it works. Consider a simple post model that has many comments, where each comment belongs to a post with an inverse defined:

Note that the Comment model here has an inverse association back to itself via its post association. Recall that Marshal handles this kind of circularity automatically using the link type (@ symbol) we saw in part 1, but that MessagePack doesn’t handle circularity by default. We’ll have to implement something like a link type to make this encoder work.

Instance Tracker handles circularity
Instance Tracker handles circularity

The trick we use for handling circularity involves something called an Instance Tracker. It tracks records encountered while traversing the record’s network of associations. The encoding algorithm builds a tree where each association is represented by its name (for example :comments or :post), and each record is represented by its unique index in the tracker. If we encounter an untracked record, we recursively traverse its network of associations, and if we’ve seen the record before, we simply encode it using its index.

This algorithm generates a very compact representation of a record’s associations. Combined with the records in the tracker, each encoded by its set of attributes, it provides a very concise representation of a record and its loaded associations.

Here’s what this representation looks like for the post with two comments shown earlier:

Once ActiveRecordCoder has generated this array of arrays, we can simply pass the result to MessagePack to encode it to a bytestring payload. For the post with two comments, this generates a payload of around 300 bytes. Considering that the Marshal payload for the post with no associations we looked at in Part 1 was 1,600 bytes in length, that’s not bad.

But what happens if we try to encode this post with its two comments using Marshal? The result is shown below: a payload over 4,000 bytes long. So the combination of ActiveRecordCoder with MessagePack is 13 times more space efficient than Marshal for this payload. That’s a pretty massive improvement.

Visual representation of the difference between an ActiveRecordCoder + MessagePack payload vs a Marshal payload
ActiveRecordCoder + MessagePack vs Marshal

In fact, the space efficiency of the switch to MessagePack was so significant that we immediately saw the change in our data analytics. As you can see in the graph below, our Rails cache memcached fill percent dropped after the switch. Keep in mind that for many payloads, for example boolean and integer valued-payloads, the change to MessagePack only made a small difference in terms of space efficiency. Nonetheless, the change for more complex objects like records was so significant that total cache usage dropped by over 25 percent.

Line graph showing Rails cache memcached fill percent versus time. The graph shows a decrease when changed to MessagePackRails cache memcached fill percent versus time

Handling Change

You might have noticed that ActiveRecordCoder, our encoder for ActiveRecord::Base objects, includes the name of record classes and association names in encoded payloads. Although our coder doesn’t encode all instance variables in the payload, the fact that it hardcodes class names at all should be a red flag. Isn’t this exactly what got us into the mess caching objects with Marshal in the first place?

And indeed, it is—but there are two key differences here.

First, since we control the encoding process, we can decide how and where to raise exceptions when class or association names change. So when decoding, if we find that a class or association name isn’t defined, we rescue the error and re-raise a more specific error. This is very different from what happens with Marshal.

Second, since this is a cache, and not, say, a persistent datastore like a database, we can afford to occasionally drop a cached payload if we know that it’s become stale. So this is precisely what we do. When we see one of the exceptions for missing class or association names, we rescue the exception and simply treat the cache fetch as a miss. Here’s what that code looks like:

The result of this strategy is effectively that during a deploy where class or association names change, cache payloads containing those names are invalidated, and the cache needs to replace them. This can effectively disable the cache for those keys during the period of the deploy, but once the new code has been fully released the cache again works as normal. This is a reasonable tradeoff, and a much more graceful way to handle code changes than what happens with Marshal.

Core Type Subclasses

With our migration plan and our encoder for ActiveRecord::Base, we were ready to embark on the first step of the migration to MessagePack. As we were preparing to ship the change, however, we noticed something was wrong on continuous integration (CI): some tests were failing on hash-valued cache payloads.

A closer inspection revealed a problem with HashWithIndifferentAccess, a subclass of Hash provided by ActiveSupport that makes symbols and strings work interchangeably as hash keys. Marshal handles subclasses of core types like this out of the box, so you can be sure that a HashWithIndifferentAccess that goes into a Marshal-backed cache will come back out as a HashWithIndifferentAccess and not a plain old Hash. The same cannot be said for MessagePack, unfortunately, as you can confirm yourself:

MessagePack doesn’t blow up here on the missing type because  HashWithIndifferentAccess is a subclass of another type that it does support, namely Hash. This is a case where MessagePack’s default handling of subclasses can and will bite you; it would be better for us if this did blow up, so we could fall back to Marshal. We were lucky that our tests caught the issue before this ever went out to production.

The problem was a tricky one to solve, though. You would think that defining an extension type for HashWithIndifferentAccess would resolve the issue, but it didn’t. In fact, MessagePack completely ignored the type and continued to serialize these payloads as hashes.

As it turns out, the issue was with msgpack-ruby itself. The code handling extension types didn’t trigger on subclasses of core types like Hash, so any extensions of those types had no effect. I made a pull request (PR) to fix the issue, and as of version 1.4.3, msgpack-ruby now supports extension types for Hash as well as Array, String, and Regex.

The Long Tail of Types

With the fix for HashWithIndifferentAccess, we were ready to ship the first step in our migration to MessagePack in the cache. When we did this, we were pleased to see that MessagePack was successfully serializing 95 percent of payloads right off the bat without any issues. This was validation that our migration strategy and extension types were working.

Of course, it’s the last 5 percent that’s always the hardest, and indeed we faced a long tail of failing cache writes to resolve. We added types for commonly cached classes like ActiveSupport::TimeWithZone and Set, and edged closer to 100 percent, but we couldn’t quite get all the way there. There were just too many different things still being cached with Marshal.

At this point, we had to adjust our strategy. It wasn’t feasible to just let any developer define new extension types for whatever they needed to cache. Shopify has thousands of developers, and we would quickly hit MessagePack’s limit of 128 extension types.

Instead, we adopted a different strategy that helped us scale indefinitely to any number of types. We defined a catchall type for Object, the parent class for the vast majority of objects in Ruby. The Object extension type looks for two methods on any object: an instance method named as_pack and a class method named from_pack. If both are present, it considers the object packable, and uses as_pack as its serializer and from_pack as its deserializer. Here’s an example of a Task class that our encoder treats as packable:

Note that, as with the ActiveRecord::Base extension type, this approach relies on encoding class names. As mentioned earlier, we can do this safely since we handle class name changes gracefully as cache misses. This wouldn’t be a viable approach for a persistent store.

The packable extension type worked great, but as we worked on migrating existing cache objects, we found many that followed a similar pattern, caching either Structs or T::Structs (Sorbet’s typed struct). Structs are simple objects defined by a set of attributes, so the packable methods were each very similar since they simply worked from a list of the object’s attributes. To make things easier, we extracted this logic into a module that, when included in a struct class, automatically makes the struct packable. Here’s the module for Struct:

The serialized data for the struct instance includes an extra digest value (26450) that captures the names of the struct’s attributes. We use this digest to signal to the Object extension type deserialization code that attribute names have changed (for example in a code refactor). If the digest changes, the cache treats cached data as stale and regenerates it:

Simply by including this module (or a similar one for T::Struct classes), developers can cache struct data in a way that’s robust to future changes. As with our handling of class name changes, this approach works because we can afford to throw away cache data that has become stale.

The struct modules accelerated the pace of our work, enabling us to quickly migrate the last objects in the long tail of cached types. Having confirmed from our logs that we were no longer serializing any payloads with Marshal, we took the final step of removing it entirely from the cache. We’re now caching exclusively with MessagePack.

Safe by Default

With MessagePack as our serialization format, the cache in our core monolith became safe by default. Not safe most of the time or safe under some special conditions, but safe, period. It’s hard to underemphasize the importance of a change like this to the stability and scalability of a platform as large and complex as Shopify’s.

For developers, having a safe cache brings a peace of mind that one less unexpected thing will happen when they ship their refactors. This makes such refactors—particularly large, challenging ones—more likely to happen, improving the overall quality and long-term maintainability of our codebase.

If this sounds like something that you’d like to try yourself, you’re in luck! Most of the work we put into this project has been extracted into a gem called Shopify/paquito. A migration process like this will never be easy, but Paquito incorporates the learnings of our own experience. We hope it will help you on your journey to a safer cache.

Chris Salzberg is a Staff Developer on the Ruby and Rails Infra team at Shopify. He is based in Hakodate in the north of Japan.


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

Caching Without Marshal Part 1: Marshal from the Inside Out

Caching Without Marshal Part 1: Marshal from the Inside Out

Caching is critical to how Rails applications work. At every layer, whether it be in page rendering, database querying, or external data retrieval, the cache is what ensures that no single bottleneck brings down an entire application. 

But caching has a dirty secret, and that secret’s name is Marshal.

Marshal is Ruby’s ultimate sharp knife, able to transform almost any object into a binary blob and back. This makes it a natural match for the diverse needs of a cache, particularly the cache of a complex web framework like Rails. From actions, to pages, to partials, to queries—you name it, if Rails is touching it, Marshal is probably caching it. 

Marshal’s magic, however, comes with risks.

A couple of years ago, these risks became very real for us. It started innocently enough. A developer at Shopify, in an attempt to clean up some code in our core monolith, shipped a PR refactoring some key classes around beta flags. The refactor got the thumbs up in review and passed all tests and other checks.

As it went out to production, though, it became clear something was very wrong. A flood of exceptions triggered an incident, and the refactor was quickly rolled back and reverted. We were lucky to escape so easily.

The incident was a wake-up call for us. Nothing in our set of continuous integration (CI) checks had flagged the change. Indeed, even in retrospect, there was nothing wrong with the code change at all. The issue wasn’t the code, but the fact that the code had changed.

The problem, of course, was Marshal. Being so widely used, beta flags were being cached. Marshal serializes an object’s class along with its other data, so many of the classes that were part of the refactor were also hardcoded in entries of the cache. When the newly deployed code began inserting beta flag instances with the new classes into the cache, the old code—which was still running as the deploy was proceeding—began choking on class names and methods that it had never seen before.

As a member of Shopify’s Ruby and Rails Infrastructure team, I was involved in the follow-up for this incident. The incident was troubling to us because there were really only two ways to mitigate the risk of the same incident happening again, and neither was acceptable. The first is simply to put less things into the cache, or less variety of things; this decreases the likelihood of cached objects conflicting with future code changes. But this defeats the purpose of having a cache in the first place.

The other way to mitigate the risk is to change code less, because it’s code changes that ultimately trigger cache collisions. But this was even less acceptable: our team is all about making code cleaner, and that requires changes. Asking developers to stop refactoring their code goes against everything we were trying to do at Shopify.

So we decided to take a deeper look and fix the root problem: Marshal. We reasoned that if we could use a different serialization format—one that wouldn’t cache any arbitrary object the way Marshal does, one that we could control and extend—then maybe we could make the cache safe by default.

The format that did this for us is MessagePack. MessagePack is a binary serialization format that’s much more compact than Marshal, with stricter typing and less magic. In this two-part series (based on a RailsConf talk by the same name), I’ll pry Marshal open to show how it works, delve into how we replaced it, and describe the specific challenges posed by Shopify’s scale.

But to start, let’s talk about caching and how Marshal fits into that.

You Can’t Always Cache What You Want

Caching in Rails is easy. Out of the box, Rails provides caching features that cover the common requirements of a typical web application. The Rails Guides provide details on how these features work, and how to use them to speed up your Rails application. So far, so good.

What you won’t find in the guides is information on what you can and can’t put into the cache. The low-level caching section of the caching guide simply states: “Rails’ caching mechanism works great for storing any kind of information.” (original emphasis) If that sounds too good to be true, that’s because it is.

Under the hood, all types of cache in Rails are backed by a common interface of two methods, read and write, on the cache instance returned by Rails.cache. While there are a variety of cache backends—in our core monolith we use Memcached, but you can also cache to file, memory, or Redis, for example—they all serialize and deserialize data the same way, by calling Marshal.load and Marshal.dump on the cached object.

A diagram showing the differences between the cache encoding format between Rails 6 and Rail 7
Cache encoding format in Rails 6 and Rails 7

If you actually take a peek at what these cache backends put into the cache, you might find that things have changed in Rails 7 for the better. This is thanks to work by Jean Boussier, who’s also in the Ruby and Rails Infrastructure team at Shopify, and who I worked with on the cache project. Jean recently improved cache space allocation by more efficiently serializing a wrapper class named ActiveSupport::Cache::Entry. The result is a more space-efficient cache that stores cached objects and their metadata without any redundant wrapper.

Unfortunately, that work doesn’t help us when it comes to the dangers of Marshal as a serialization format: while the cache is slightly more space efficient, all those issues still exist in Rails 7. To fix the problems with Marshal, we need to replace it.

Let’s Talk About Marshal

But before we can replace Marshal, we need to understand it. And unfortunately, there aren’t a lot of good resources explaining what Marshal actually does.

To figure that out, let’s start with a simple Post record, which we will assume has a title column in the database:

We can create an instance of this record and pass it to  Marshal.dump:

This is what we get back:

This is a string of around 1,600 bytes, and as you can see, a lot is going on in there. There are constants corresponding to various Rails classes like ActiveRecord, ActiveModel and ActiveSupport. There are also instance variables, which you can identify by the @ symbol before their names. And finally there are many values, including the name of the post, Caching Without Marshal, which appears three times in the payload.

The magic of Marshal, of course, is that if we take this mysterious bytestring and pass it to Marshal.load, we get back exactly the Post record we started with.

You can do this a day from now, a week from now, a year from now, whenever you want—you will get the exact same object back. This is what makes Marshal so powerful.

And this is all possible because Marshal encodes the universe. It recursively crawls objects and their references, extracts all the information it needs, and dumps the result to the payload.

But what is actually going on in that payload? To figure that out, we’ll need to dig deeper and go to the ultimate source of truth in Ruby: the C source code. Marshal’s code lives in a file called marshal.c. At the top of the file, you’ll find a bunch of constants that correspond to the types Marshal uses when encoding data.

Marshal types defined in marshal.c
Marshal types defined in marshal.c

Top in that list are MARSHAL_MAJOR and MARSHAL_MINOR, the major and minor versions of Marshal, not to be confused with the version of Ruby. This is what comes first in any Marshal payload. The Marshal version hasn’t changed in years and can pretty much be treated as a constant.

Next in the file are several types I will refer to here as “atomic”, meaning types which can’t contain other objects inside themself. These are the things you probably expect: nil, true, false, numbers, floats, symbols, and also classes and modules.

Next, there are types I will refer to as “composite” that can contain other objects inside them. Most of these are unsurprising: array, hash, struct, and object, for example. But this group also includes two you might not expect: string and regex. We’ll return to this later in this article.

Finally, there are several types toward the end of the list whose meaning is probably not very obvious at all. We will return to these later as well.

Objects

Let’s first start with the most basic type of thing that Marshal serializes: objects. Marshal encodes objects using a type called TYPE_OBJECT, represented by a small character o.

Marshal-encoded bytestring for the example Post
Marshal-encoded bytestring for the example post

Here’s the Marshal-encoded bytestring for the example Post we saw earlier, converted to make it a bit easier to parse.

The first thing we can see in the payload is the Marshal version (0408), followed by an object, represented by an ‘o’ (6f). Then comes the name of the object’s class, represented as a symbol: a colon (3a) followed by the symbol’s length (09) and name as an ASCII string (Post). (Small numbers are stored by Marshal in an optimized format—09 translates to a length of 4.) Then there’s an integer representing the number of instance variables, followed by the instance variables themselves as pairs of names and values.

You can see that a payload like this, with each variable itself containing an object with further instance variables of its own, can get very big, very fast.

Instance Variables

As mentioned earlier, Marshal encodes instance variables in objects as part of its object type. But it also encodes instance variables in other things that, although seemingly object-like (subclassing the Object class), aren’t in fact implemented as such. There are four of these, which I will refer to as core types, in this article: String, Regex, Array, and Hash. Since Ruby implements these types in a special, optimized way, Marshal has to encode them in a special way as well.

Consider what happens if you assign an instance variable to a string, like this:

This may not be something you do every day, but it’s something you can do. And you may ask: does Marshal handle this correctly?

The answer is: yes it does.

It does this using a special type called TYPE_IVAR to encode instance variables on things that aren’t strictly implemented as objects, represented by a variable name and its value. TYPE_IVAR wraps the original type (String in this case), adding a list of instance variable names and values. It’s also used to encode instance variables in hashes, arrays, and regexes in the same way.

Circularity

Another interesting problem is circularity: what happens when an object contains references to itself. Records, for example, can have associations that have inverses pointing back to the original record. How does Marshal handle this?

Take a minimal example: an array which contains a single element, the array itself:

What happens if we run this through Marshal? Does it segmentation fault on the self-reference? 

As it turns out, it doesn’t. You can confirm yourself by passing the array through Marshal.dump and Marshal.load:

Marshal does this thanks to an interesting type called the link type, referred to in marshal.c as TYPE_LINK.

TYPE_LINK example

The way Marshall does this is quite efficient. Let’s look at the payload: 0408 5b06 4000. It starts with an open square bracket (5b) representing the array type, and the length of the array (as noted earlier, small numbers are stored in an optimized format, so 06 translates to a length of 1). The circularity is represented by a @ (40) symbol for the link type, followed by an index of the element in the encoded object the link is pointing to, in this case 00 for the first element (the array itself).

In short, Marshal handles circularity out of the box. That’s important to note because when we deal with this ourselves, we’re going to have to reimplement this process.

Core Type Subclasses

I mentioned earlier that there are a number of core types that Ruby implements in a special way, and that Marshal also needs to handle in a way that’s distinct from other objects. Specifically, these are: String, Regex, Array, and Hash.

One interesting edge case is what happens when you subclass one of these classes, like this:

If you create an instance of this class, you’ll see that while it looks like a hash, it’s, indeed, an instance of the subclass:

So what happens if you encode this with Marshal? If you do, you’ll find that it actually captures the correct class:

Marshal does this because it has a special type called TYPE_UCLASS. To the usual data for the type (hash data in this case), TYPE_UCLASS adds the name of the class, allowing it to correctly decode the object when loading it back. It uses the same type to encode subclasses of strings, arrays, and regexes (the other core types).

The Magic of Marshal

We’ve looked at how Marshal encodes several different types of objects in Ruby. You might be wondering at this point why all this information is relevant to you.

The answer is because—whether you realize it or not—if you’re running a Rails application, you most likely rely on it. And if you decide, like we did, to take Marshal’s magic out of your application, you’ll find that it’s exactly these things that break. So before doing that, it’s a good idea to figure out how to replace each one of them.

That’s what we did, with a little help from a format called MessagePack. In the next part of this series, we’ll take a look at the steps we took to migrate our cache to MessagePack. This includes re-implementing some of the key Marshal features, such as circularity and core type subclasses, explored in this article, as well as a deep dive into our algorithm for encoding records and their associations.

Chris Salzberg is a Staff Developer on the Ruby and Rails Infra team at Shopify. He is based in Hakodate in the north of Japan.


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

Apollo Cache is Your Friend, If You Get To Know It

Apollo Cache is Your Friend, If You Get To Know It

Currently Shopify is going through the process of updating from the Apollo GraphQL client 2 to client 3. The Apollo client is a library used to query your GraphQL services from the frontend, it has a feature for caching objects/queries you’ve already made which will be the focus of this post. Through the process of migrating the Apollo client we started to discuss bugs we've run into in the past, for which a common thread was misuse, or more likely a misunderstanding of the cache. This was the catalyst for me diving further into the cache and exploring how it fetches, transforms, and stores data that we previously queried. Having worked with the Apollo client for the vast majority of my career, I still couldn’t say I understood exactly what was happening in the cache internally. So I felt compelled to find out. In this post, I’ll focus on the Apollo client cache and the life cycle of objects that are cached within it. You’ll learn:

  • What the cache is.
  • Where it gets data from.
  • What data looks like within it.
  • How it changes over time.
  • How we get rid of it, if at all.
GraphQL query that returns unexpected data for a reason we will explore in this blog

The query above isn’t returning the proper data for the metas.metaData.values object because it doesn’t return anything for the slug field. Do you see what’s going wrong here? I definitely didn’t understand this before diving into my cache research, but we’ll circle back to this in a bit after we explore the cache a little more and see if we can unearth what’s going on.

What exactly is the Apollo cache? It’s an InMemoryCache where the data from your network queries are stored in memory. This means that when you leave your browser session (by closing or reloading the tab) the data in the InMemoryCache will not be persisted.  The cache isn’t stored in local storage or somewhere that can persist between sessions, it’s only available during the session it was created in. The other thing is that it’s not quite your data, it's a representation of your data (but we’ll circle back to that concept in a bit).

Fetch Policies: Where Does the Data Come From?

A flow diagram showing the lifecycle of an object in the cache can be broken up into 4 parts: Fetching (highlighted in orange font in image), Normalization, Updating and Merging, and Garbage Collection and Eviction
The lifecycle of an object in the cache can be broken up into 4 parts: Fetching, Normalization, Updating and Merging and finally Garbage Collection and Eviction. The first section we will dive into is fetching.

The first step to understanding the cache is knowing when we actually use data from it or retrieve data over the network. This is where fetch policies come into play. A fetch policy defines where to get data from, be it the network, the cache, or a mixture of the two. I won’t get too deep into fetch policies as Apollo has a great resource on their website. If you don’t explicitly set a fetch policy for your GraphQL calls (like from useQuery), the Apollo client will default to the cache-first policy. With this policy, Apollo looks in the cache, and if all the data you requested is there, it’s returned from the cache. Otherwise Apollo goes to the network, saves the new data in the cache, and returns that data to you. Understanding when to use the various fetch policies (which are different variations of going to the network first, going to the cache only or network only) saves you considerable headache for solving some bugs. 

Data Normalization: How Is the Data Stored?

A flow diagram showing the lifecycle of an object in the cache broken up into 4 parts: Fetching, Normalization (highlighted in orange font in image), Updating and Merging, and Garbage Collection and Eviction
Next we move onto the second part of an object's lifecycle in the cache, Normalization.

Now that we know where we’re getting our data from, we can delve into how only a representation of your data is stored. That's where normalization comes into play. Whenever you query data that’s stored in the cache, it goes through a process called normalization that can be broken down into three steps.

First Step of normalization, object breakdown

The first step of the normalization process is to split your queried data into individual objects. The cache tries to split the objects up as best as possible, using ID fields as a cue for when to do so. These ID’s also need to be unique, but that falls into the next step of the normalization flow.

The second step is now to take each of the objects that have been broken out and assign it a globally unique cache identifier. These identifiers are normally created by appending the object's __typename field with its id field. Now the keyword here is normally. As you can probably imagine your graph has some fields that can be given globally unique identifiers but using a field other than an id field that those objects may not have. Well that's where the keyfields API comes into play. It allows you to define another set of fields besides just __typename and id to be used for cache keys created. The key to having a good cache key is that it's stable and reproducible, so whenever the cache is looking up objects it's consistent.

Speaking of lookups, we come to the final step of normalization that's taking those broken out objects with their unique identifiers and putting them into a flattened data structure. The reason the Apollo Cache uses a flattened data structure (essentially a hash map) is because it has the fastest lookup time for those objects. That's why the keys are so key (pun intended) to the process as they allow the cache to consistently and quickly return objects when they're looked up. This also ensures that any duplicate objects are stored in the same location in the cache, making it as small as possible.

Internal cache structure after normalization is finished

Automatic Updates: Merging and Adding Data into Our Cache

A flow diagram showing the lifecycle of an object in the cache can be broken up into 4 parts: Fetching , Normalization, Updating and Merging (highlighted in orange font in image), and Garbage Collection and Eviction
For an object's lifecycle in the cache we are moving onto the Updating and Merging step.

After data is stored in our cache from our first query, you may be wondering what happens when new data comes in? This is where things get a little closer to the reality of working with the cache. It’s a sticking point for many (myself included) because when things are automatic, like how the cache updates are usually, it feels like magic, but when you expect automatic updates to happen in your UI and instead nothing happens, it becomes a huge frustration. So let’s delve into these (not so) automatic updates that happen when we query for new data or get data sent to the frontend from mutation responses. Whenever we query for data (or have a mutation respond with updates), and our cache policy is one that lets us interact with the cache, one of two things happen with the new data and the existing data. The new data’s IDs are calculated, then they are either found to exist in the current cache and we update that data, or they’re new data objects and are added to the cache. This is a theoretical last step in the object lifecycle where if the same structure is used these objects are continually overwritten and updated with new data.

So knowing this, when the cache is automatically updating the UI, we understand that to be a merge. The following are the two situations where you can expect your data to be merged and updated in the UI automatically.

1. You are editing a single entity and returning the same type in your response

For example, you’ve got a product and you favorite that product. You likely fire a mutation with the products ID, but you must have that mutation return the product as its return type, with the ID of the product favorited and at least the field that determines its favorite status. When this data returns, the cache calculates that internal cache ID and determines there’s already an object with that ID in the cache. It then merges your incoming object (preferring the fields from the incoming object) with the one that’s found in the cache. Finally, it broadcasts an update to any queries that had queried this object previously, and they receive the updated data, re-rendering those components.

2. The second situation is you’re editing entities and returning all entries in that collection of the same type

This is very similar to the first situation, except that this automatic update behavior also works with collections of objects. The only caveat is that all the objects in that collection must be returned in order for an automatic update to occur as the cache doesn’t know what your intentions are with any missing or added objects.

Now for the more frustrating part of automatic updates is when the cache won’t automatically update for you. The following are the four situations you’ll face.

1. Your query response data isn’t related with the changes you want to happen in the cache

This one is straightforward, if you want your query response to change data that you didn’t respond to it with, you need to write an update function in the cache to do this side effect change for you. It really comes into play when you want to do things that are related to response data but isn’t directly that data. For example, extending on our favorite scenario from before, if you successfully complete the favoriting action, but you want a number of favourited products to update, that requires an update function to be written for that data or a refetch for a “number of favourited products” query to work.

2. You’re unable to return an entire set of changed objects

This expands on the returning entire collections point above, if you change multiple entities in a mutation for example and want those to be reflected in the UI automatically, your mutation must return the original list in its entirety, with all the objects and their corresponding IDs. This is due to the fact that the cache doesn’t infer what you want to do with missing objects, whether they should be removed from the list or something else. So you, as the developer, must be explicit with your return data. 

3. The order of the response set is different from the currently cached one

For example, you’re changing the order of a list of todos (very original, I know), if you fire a mutation to change its order and get a response, you will notice that the UI isn’t automatically updated, even though you returned all the todos and their IDs. This is because the cache, again, doesn’t infer the meaning of changes like order, so to reflect an order change, an update function needs to be written to have an order change reflected.

4. The response data has an added or removed item in it

This is similar to #2, but essentially the cache can’t reason that an item has been added or removed from a list unless a whole list is returned. For example, the favoriting situation, if on the same page we have a list of favorites, and we unfavorite a product outside this list, its removal from the list isn’t immediate as we likely only returned the removed objects ID. In this scenario, we also need to write an update function for that list of favorited objects to remove the object we’re operating on.

… I Did Say We Would Circle Back to That Original Query

Circling back to the erroneous query I mentioned originally, let’s see if we find what went wrong here now.

Now that we’ve got a bit of a handle on how automatic updates (merging) and normalization work, let’s circle back to that query that isn’t returning the proper data. So in this query above the productMetas and metaData objects are returning the same type, MetaData, in this example they both had the same ID, and the cache normalized them into a singular object. The issue really came to light during that normalization process as the cache tried to normalize the values object on these into a singular value. However, you’ll notice only one of the values objects has an id field and the other just returns a slug. So here the cache is unable to normalize that second value object correctly due to it not having a matching id and therefore is “losing” the data. But the data isn’t lost, it's just not included in the normalized MetaData.values object. So the solution here is relatively simple, we just need to return the id for the second value object so the cache can recognize them as the same object and merge them correctly.

Corrected query from the original issue.

In the cached object lifecycle this is essentially the end, without further interference objects will live in your normalized cache indefinitely as you update them or add new ones. There are situations however, where you might want to remove unused objects from your cache, especially when your application is long lived and has a lot of data coming into it. For example, if you have a mapping application where you move the map with a bunch of points of interest on it, the points of interest you moved away from will sit in the cache but are essentially useless, taking up memory. Over time you’ll notice the application get slower as the cache takes up more memory, so how can we mitigate this?

Garbage Collection: Cleaning Up After Ourselves

We have reached the final step of an object's lifecycle in the cache, Garbage Collection and Eviction.

Well the best way to deal with that leftover data is to use the garbage collector built into the Apollo client. In client 3, it's a new tool for clearing out the cache, a simple call to the cache.gc() method clears unreachable items from the cache and returns a list of IDs for removed items. Garbage collection isn’t run automatically however, so it’s up to the developer to run this method themselves. Now let’s explore how these unreachable items are created.

Below is a sample app (available here). In this app I have a pixel representation of a pikachu (Painstakingly recreated in code by yours truly), and I’m printing out the cache to the right of it. You will notice a counter that says “Cache Size: 212”. This is a calculation of the number of keys in the normalized cache, and this is just top level keys to illustrate a rough idea of the cache size.

Screenshot from the demo app, depicting a pixelated pikachu and the cached data its using
Screenshot from the demo app, depicting a pixelated pikachu and the cached data its using.

Now behind this frontend application is a backend GraphQL server with a few mutations setup. All these pixels are being delivered from a PixelImage query. There’s also a mutation, where you can send a new color to change the pikachu’s main body pixels to get the shiny version of pikachu. So I’m going to fire that query and take a look at the size of the cache below:

Screenshot of Pixelated pikachu from demo with the new coloured pixels returned, showing a much larger cache
Pixelated pikachu demo with the new coloured pixels returned, showing a much larger cache

Notice that the cache is now 420 keys large. It essentially doubles in size because the pixels all have unique identifiers that changed when we changed pikachus colors. So the new data came in after our mutation and replaced the old data. Now the old pixel objects for our regular pikachu aren’t deleted. In fact they’re still rolling around in the cache, but they just aren’t reachable. This is how we orphan objects in our cache by re-querying the same data with new identifiers, and this (contrived example) is why we might need the garbage collector. So let's take a look at a representation of the cache below, where the red outlines are the garbage collector traversing the tree of cached objects. On the left are our new and reachable objects, you can see the Root is the root of our GraphQL queries, and the garbage collector is able to go from object to object, determining that things are reachable in the cache. On the right is our original query, which is no longer reachable from the root, and this is how the garbage collector determines that these objects are to be removed from memory.

A tree structure outlining how the garbage collector visits reachable objects from the root, and removes any objects it cannot reach (ones without a red line going to them)
A tree structure outlining how the garbage collector visits reachable objects from the root, and removes any objects it cannot reach (ones without a red line going to them)

The garbage collector removing objects essentially finishes the lifecycle of an object in the cache. Thinking of any field requested from your GraphQL server as being part of an object that is living and updating in the cache over time has really made some of the interactions in my applications I run into so much more clear. For example, whenever I query for things with IDs, I clearly put it in my mind that I may be able to extract automatic updates for those objects when I mutate states like changing whether something is pinned or favorited, leading to components that are designed around the GraphQL data updates. When the GraphQL data determines state updates purely by its values we don’t end up duplicating server side data into client side state management, a step that often adds further complexity to our application.Hopefully this peeling back of the caching layers leads to you thinking about how you query for objects, and how you can take advantage of some of the free updates you can get through the cache. I encourage you to take a look at the demo applications (however crude) below to see the cache updating on screen in real time as you perform different interactions and add the raw form of the cache representation to your mental model of frontend development with the apollo client.

Demo Applications

Just fork these two projects, in the server project once it has completed initialization, take the “url” displayed and go and update the frontend projects ApolloClient setup with that url so you can make those queries.

Raman is a senior developer at Shopify. He's had an unhealthy obsession with all things GraphQL throughout his career so far and plans to keep digging into it more. He's impatiently waiting for winter to get out snowboarding again and spends an embarrassing amount of time talking about and making food.


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our career page to find out about our open positions and learn about Digital by Design.

Continue reading

Reducing BigQuery Costs: How We Fixed A $1 Million Query

Reducing BigQuery Costs: How We Fixed A $1 Million Query

During the infrastructural exploration of a pipeline my team was building, we discovered a query that could have cost us nearly $1 million USD a month in BigQuery. Below, we’ll detail how we reduced this and share our tips for lowering costs in BigQuery.

Processing One Billion Rows of Data

My team was responsible for building a data pipeline for a new marketing tool we were shipping to Shopify merchants. We built our pipeline with Apache Flink and launched the tool in an early release to a select group of merchants. Fun fact: this pipeline became one of the first productionized Flink pipelines at Shopify. During the early release, our pipeline ingested one billion rows of data into our Flink pipeline's internal state (managed by RocksDB), and handled streaming requests from Apache Kafka

We wanted to take the next step by making the tool generally available to a larger group of merchants. However, this would mean a significant increase in the data our Flink pipeline would be ingesting. Remember, our pipeline was already ingesting one billion rows of data for a limited group of merchants. Ingesting an ever-growing dataset wouldn’t be sustainable. 

As a solution, we looked into a SQL-based external data warehouse. We needed something that our Flink pipeline could submit queries to and that could write back results to Google Cloud Storage (GCS). By doing this, we could simplify the current Flink pipeline dramatically by removing ingestion, ensuring we have a higher throughput for our general availability launch.

The external data warehouse needed to meet the following three criteria:

  1. Atomically load the parquet dataset easily
  2. Handle 60 requests per minute (our general availability estimation) without significant queuing or waiting time
  3. Export the parquet dataset to GCS easily

The first query engine that came to mind was BigQuery. It’s a data warehouse that can both store petabytes of data and query those datasets within seconds. BigQuery is fully managed by Google Cloud Platform and was already in use at Shopify. We knew we could load our one billion row dataset into BigQuery and export query results into GCS easily. With all of this in mind, we started the exploration but we met an unexpected obstacle: cost.

A Single Query Would Cost Nearly $1 Million

As mentioned above, we’ve used BigQuery at Shopify, so there was an existing BigQuery loader in our internal data modeling tool. So, we easily loaded our large dataset into BigQuery. However, when we first ran the query, the log showed the following:

total bytes processed: 75462743846, total bytes billed: 75462868992

That roughly translated to 75 GB billed from the query. This immediately raised an alarm because BigQuery is charged by data processed per query. If each query were to scan 75 GB of data, how much would it cost us at our general availability launch? 

I quickly did some rough math. If we estimate 60 RPM at launch, then:

60 RPM x 60 minutes/hour x 24 hours/day x 30 days/month = 2,592,000 queries/month 

If each query scans 75 GB of data, then we’re looking at approximately 194,400,000 GB of data scanned per month. According to BigQuery’s on-demand pricing scheme, it would cost us $949,218.75 USD per month!

Clustering to the Rescue

With the estimation above, we immediately started to look for solutions to reduce this monstrous cost. 

We knew that clustering our tables could help reduce the amount of data scanned in BigQuery. As a reminder, clustering is the act of sorting your data based on one or more columns in your table. You can cluster columns in your table by fields like DATE, GEOGRAPHY, TIMESTAMP, ect. You can then have BigQuery scan only the clustered columns you need.

With clustering in mind, we went digging and discovered several condition clauses in the query that we could cluster. These were ideal because if we clustered our table with columns appearing in WHERE clauses, we could apply filters in our query that would ensure only specific conditions are scanned. The query engine will stop scanning once it finds those conditions, ensuring only the relevant data is scanned instead of the entire table. This reduces the amount of bytes scanned and would save us a lot of processing time. 

We created a clustered dataset on two feature columns from the query’s WHERE clause. We then ran the exact same query and the log now showed 508.1 MB billed. That’s 150 times less data scanned than the previous unclustered table. 

With our newly clustered table, we identified that the query would now only scan 108.3 MB of data. Doing some rough math again:

2,592,000 queries/month x 0.1 GB of data = 259,200 GB of data scanned/month

That would bring our cost down to approximately $1,370.67 USD per month, which is way more reasonable.

Other Tips for Reducing Cost

While all it took was some clustering for us to significantly reduce our costs, here are a few other tips for lowering BigQuery costs:

  • Avoid the SELECT* statement: Only select the columns in the table you need queried. This will limit the engine scan to only those columns, therefore limiting your cost. 
  • Partition your tables: This is another way to restrict the data scanned by dividing your table into segments (aka partitions). You can create partitions in BigQuery based on time-units, ingestion time or integer range.
  • Don’t run queries to explore or preview data: Doing this would be an unnecessary cost. You can use table preview options to view data for free.

And there you have it. If you’re working with a high volume of data and using BigQuery, following these tips can help you save big. Beyond cost savings, this is critical for helping you scale your data architecture. 

Calvin is a senior developer at Shopify. He enjoys tackling hard and challenging problems, especially in the data world. He’s now working with the Return on Ads Spend group in Shopify. In his spare time, he loves running, hiking and wandering in nature. He is also an amateur Go player.


Are you passionate about solving data problems and eager to learn more about Shopify? Check out openings on our careers page.

Continue reading

The Management Poles of Developer Infrastructure Teams

The Management Poles of Developer Infrastructure Teams

Over the past few years, as I’ve been managing multiple developer infrastructure teams at once, I’ve found some tensions that are hard to resolve. In my current mental model, I have found that there are three poles that have a natural tension and are thus tricky to balance: management support, system and domain expertise, and road maps. I’m going to discuss the details of these poles and some strategies I’ve tried to manage them.

What’s Special About Developer Infrastructure Teams?

Although this model likely can apply to any software development team, the nature of developer infrastructure (Dev Infra) makes this situation particularly acute for managers in our field. These are some of the specific challenges faced in Dev Infra:

  • Engineering managers have a lot on their plates. For whatever reason, infrastructure teams usually lack dedicated product managers, so we often have to step in to fill that gap. Similarly, we’re responsible for tasks that usually fall to UX experts, such as doing user research.
  • There’s a lot of maintenance and support. Teams are responsible for keeping multiple critical systems online with hundreds or thousands of users, usually with only six to eight developers. In addition, we often get a lot of support requests, which is part of the cost of developing in-house software that has no extended community outside the company.
  • As teams tend to organize around particular phases in the development workflow, or sometimes specific technologies, there’s a high degree of domain expertise that’s developed over time by all its members. This expertise allows the team to improve their systems and informs the team’s road map.

What Are The Three Poles?

The Dev Infra management poles I’ve modelled are tensions, much like that between product and engineering. They can’t, I don’t believe, all be solved at the same time—and perhaps they shouldn’t be. We, Dev Infra managers, balance them according to current needs and context and adapt as necessary. For this balancing act, it behooves us to make sure we understand the nature of these poles.

1. Management Support

Supporting developers in their career growth is an important function of any engineering manager. Direct involvement in team projects allows the tightest feedback loops between manager and report, and thus the highest-quality coaching and mentorship. We also want to maximize the number of reports per manager. Good managers are hard to find, and even the best manager adds a bit of overhead to a team’s impact.

We want the manager to be as involved in their reports’ work as possible, and we want the highest number of reports per manager that they can handle. Where this gets complicated is balancing the scope and domain of individual Dev Infra teams and of the whole Dev Infra organization. This tension is a direct result of the need for specific system and domain expertise on Dev Infra teams.

2. System and Domain Expertise

As mentioned above, in Dev Infra we tend to build teams around domains that represent phases in the development workflow, or occasionally around specific critical technologies. It’s important that each team has both domain knowledge and expertise in the specific systems involved. Despite this focus, the scope of and opportunities in a given area can be quite broad, and the associated systems grow in size and complexity.

Expertise in a team’s systems is crucial just to keep everything humming along. As with any long-running software application, dependencies need to be managed, underlying infrastructure has to be occasionally migrated, and incidents must be investigated and root causes solved. Furthermore, at any large organization, Dev Infra services can have many users relative to the size of the teams responsible for them. Some teams will require on-call schedules in case a critical system breaks during an emergency (finding out the deployment tool is down when you’re trying to ship a security fix is, let’s just say, not a great experience for anyone).

A larger team means less individual on-call time and more hands for support, maintenance, and project work. As teams expand their domain knowledge, more opportunities are discovered for increasing the impact of the team’s services. The team will naturally be driven to constantly improve the developer experience in their area of expertise. This drive, however, risks a disconnect with the greatest opportunities for impact across Dev Infra as a whole.

3. Road Maps

Specializing Dev Infra teams in particular domains is crucial for both maintenance and future investments. Team road maps and visions improve and expand upon existing offerings: smoothing interfaces, expanding functionality, scaling up existing solutions, and looking for new opportunities to impact development in their domain. They can make a big difference to developers during particular phases of their workflow like providing automation and feedback while writing code, speeding up continuous integration (CI) execution, avoiding deployment backlogs, and monitoring services more effectively.

As a whole Dev Infra department, however, the biggest impact we can have on development at any given time changes. When Dev Infra teams are first created, there’s usually a lot of low- hanging fruit—obvious friction at different points in the development workflow—so multiple teams can broadly improve the developer experience in parallel. At some point, however, some aspects of the workflow will be much smoother than others. Maybe CI times have finally dropped to five minutes. Maybe deploys rarely need attention after being initiated. At a large organization, there will always be edge cases, bugs, and special requirements in every area, but their impact will be increasingly limited when compared to the needs of the engineering department as a whole.

At this point, there may be an opportunity for a large new initiative that will radically impact development in a particular way. There may be a few, but it’s unlikely that there will be the need for radical changes across all domains. Furthermore, there may be unexplored opportunities and domains for which no team has been assembled. These can be hard to spot if the majority of developers and managers are focused on existing well-defined scopes.

How to Maintain the Balancing Act

Here’s the part where I confess that I don’t have a single amazing solution to balance management support, system maintenance and expertise, and high-level goals. Likely there are a variety of solutions that can be applied and none are perfect. Here are three ideas I’ve thought about and experimented with.

1. Temporarily Assign People from One Team to a Project on Another

If leadership has decided that the best impact for our organization at this moment is concentrated in the work of a particular team, call it Team A, and if Team A’s manager can’t effectively handle any more reports, then a direct way to get more stuff done is to take a few people from another team (Team B) and assign them to the Team A’s projects. This has some other benefits as well: it increases the number of people with familiarity in Team A’s systems, and people sometimes like to change up what they’re working on.

When we tried this, the immediate question was “should the people on loan to Team A stay on the support rotations for their ‘home’ team?” From a technical expertise view, they’re important to keep the lights on in the systems they’re familiar with. Leaving them on such rotations prevents total focus on Team A, however, and at a minimum extends the onboarding time. There are a few factors to consider: the length of the project(s), the size of Team B, and the existing maintenance burden on Team B. Favour removing the reassigned people from their home rotations, but know that this will slow down Team A’s work even more as they pick up the extra work.

The other problem we ran into is that the manager of Team B is disconnected from the work their reassigned reports are now working on. Because the main problem is that Team A’s manager doesn’t have enough bandwidth to have more reports, there’s less management support for the people on loan, in terms of mentoring, performance management, and prioritization. The individual contributor (IC) can end up feeling disconnected from both their home team and the new one.

2. Have a Whole Team Contribute to Another Team’s Goals

We can mitigate at least the problem of ICs feeling isolated in their new team if we have the entire team (continuing the above nomenclature, Team B) work on the systems that another team (Team A) owns. This allows members of Team B to leverage their existing working relationships with each other, and Team B’s manager doesn’t have to split their attention between two teams. This arrangement can work well if there is a focused project in Team A’s domain that somehow involves some of Team B’s domain expertise.

This is, of course, a very blunt instrument, in that no project work will get done on Team B’s systems, which themselves still need to be maintained. There’s also a risk of demotivating the members of Team B, who may feel that their domain and systems aren’t important, although this can be mitigated to some extent if the project benefits or requires their domain expertise. We’ve had success here in exactly that way in an ongoing project done by our Test Infrastructure team to add data from our CI systems to Services DB, our application-catalog app stewarded by another team, Production Excellence. Their domain expertise allowed them to understand how to expose the data in the most intuitive and useful way, and they were able to more rapidly learn Services DB’s codebase by working together.

3. Tiger Team

A third option we’ve tried out in Dev Infra is a tiger team: “a specialized, cross-functional team brought together to solve or investigate a specific problem or critical issue.” People from multiple teams form a new, temporary team for a single project, often prototyping a new idea. Usually the team operates in a fast-paced, autonomous way towards a very specific goal, so management oversight is fairly limited. By definition, most people on a tiger team don’t usually work together, so the home and new team dichotomy is sidestepped, or at least very deliberately managed. The focus of the team means that members put aside maintenance, support, and other duties from their home team for the duration of the team’s existence.

The very first proof of concept for Spin was built this way over about a month. At that time, the value was sufficiently clear that we then formed a whole team around Spin and staffed it up to tackle the challenge of turning it into a proper product. We’ve learned a lot since then, but that first prototype was crucial in getting the whole project off the ground!

No Perfect Solutions

From thinking about and experimenting with team structures during my decade of management experience, there doesn’t seem to be a perfect solution to balance the three poles of management support, system maintenance and domain expertise, and high-level goals. Each situation is unique, and trade-offs have to be judged and taken deliberately. I would love to hear other stories of such balancing acts! Find me on Twitter and LinkedIn.

Mark Côté is the Director of Engineering, Developer Infrastructure, at Shopify. He's been in the developer-experience space for over a decade and loves thinking about infrastructure-as-product and using mental models in his strategies.


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

Hubble: Our Tool for Encapsulating and Extending Security Tools

Hubble: Our Tool for Encapsulating and Extending Security Tools

Fundamentally, Shopify is a company that thrives by building simplicity. We take hard, risky, and complex things and make them easy, safe, and simple.

Trust is Shopify’s team responsible for making commerce secure for everyone. First and foremost, that means securing our internal systems and IT resources, and maintaining a strong cybersecurity posture. If you’ve worked in these spaces before, you know that it takes a laundry list of tools to effectively manage and secure a large fleet of computers. Not only does it take tons of tools, but also takes training, access provisioning and deprovisioning, and constant patching. In any large or growing company, these problems compound and can become exponential costs if they aren’t controlled and solved for.

You either pay that cost by spending countless human hours on menial processes and task switching, or you accept the risk of shadow IT—employees developing their own processes and workarounds rather than following best practices. You either get choked by bureaucracy, or you create such a low trust environment that people don’t feel their company is interested in solving their problems.

Shopify is a global company that, in 2020, embraced being Digital by Design—in essence, the firm belief that our people have the greatest impact when we support them to work whenever and wherever they like. As you can imagine, this only magnified the problems described above. With the end of office centricity, suddenly the work of securing our devices got a lot more important, and a lot more difficult. Network environments got more varied, the possibility of in-person patching or remediation went out the window—the list goes on. Faced with these challenges, we searched for off-the-shelf solutions, but couldn’t find anything that fully fit our needs.

Hubble logo which features an image of a telescope followed by the word Hubble
Hubble Logo

So, We Built Hubble.

An evolution of previous internal solutions, Hubble is a tool that encapsulates and extends many of the common tools used in security. Mobile device management services and more are all fully integrated into Hubble. For IT staff, Hubble is a one stop shop for inventory, device management, and security. Rather than granting hundreds of employees access to multiple admin panels, they access Hubble—which ingests and standardizes data from other systems, and then sends commands back to those systems. We also specify levels of granularity in access (a specialist might have more access than an entry level worker, for instance). On the back end, we also track and audit access in one central location with a consistent set of fields—making incident response and investigation less of a rabbit hole.

A screenshot of the Hubble screen on a Macbook Pro. It shows a profile picture and several lines of status updates about the machine
Hubble’s status screen on a user’s machine

For everyone else at Shopify, Hubble is a tool to manage and view the devices that belong to them. At a glance, they can review the health of their device and its compliance, and not just an arbitrary set of metrics, but something that we define and find valuable - things like OS/Patch Compliance, VPN usage, and more. Folks don’t need to ask IT or just wonder if their device is secure. Hubble informs them, either via the website or device notification pings. And if their device isn’t secure, Hubble provides them with actionable information on how to fix it. Users can also specify test devices, or opt in to betas that we run. This enables us to easily build beta cohorts for any testing we might be running. When you give people the tools to be proactive about their security, and show that you support that proactivity, you help build a culture of ownership.

And, perhaps most importantly, Hubble is a single source of truth for all the data it consumes. This makes it easier for other teams to develop automations and security processes. They don’t have to worry about standardizing data, or making calls to 100 different services. They can access Hubble, and trust that the data is reliable and standardized.

Now, why should you care about this? Hubble is an internal tool for Shopify, and unfortunately it isn’t open source at this time. But these two lessons we learned building and realizing Hubble are valuable and applicable anywhere.

1. When the conversation is centered on encapsulation, the result is a partnership in creating a thoughtful and comprehensive solution.

Building and maintaining Hubble requires a lot of teams talking to each other. Developers talk to support staff, security engineers, and compliance managers. While these folks often work near each other, they rarely work directly together. This kind of collaboration is super valuable and can help you identify a lot of opportunities for automation and development. Plus, it presents the opportunity for team members to expand their skills, and maybe have an idea of what their next role could be. Even if you don’t plan to build a tool like this, consider involving frontline staff with the design and engineering processes in your organization. They bring valuable context to the table, and can help surface the real problems that your organization faces.

2. It’s worth fighting for investment.

IT and Cybersecurity are often reactive and ad-hoc driven teams. In the worst cases, this field lends itself to unhealthy cultures and an erratic work life balance. Incident response teams and frontline support staff often have unmanageable workloads and expectations, in large part due to outdated tooling and processes. We strive to make sure it isn’t like that at Shopify, and it doesn’t have to be that way where you work. We’ve been able to use Hubble as a platform for identifying automation opportunities. By having engineering teams connected to support staff via Hubble, we encourage a culture of proactivity. Teams don’t just accept processes as broken and outdated—they know that there’s talent and resources available for solving problems and making things better. Beyond culture and work life balance, consider the financial benefits and risk-minimization that this strategy realizes.

For each new employee onboarded to your IT or Cybersecurity teams, you spend weeks if not months helping them ramp up and safely access systems. This can incur certification and training costs (which can easily run in the thousands of dollars per employee if you pay for their certifications), and a more difficult job search to find the right candidate. Then you take on the risk of all these people having direct access to sensitive systems. And finally, you take on the audit and tracking burden of all of this.

With each tool you add to your environment, you increase complexity exponentially. But there’s a reason those tools exist, and complexity on its own isn’t a good enough reason to reject a tool. This is a field where costs want to grow exponentially. It seems like the default is to either accept that cost and the administrative overhead it brings, or ignore the cost and just eat the risk. It doesn’t have to be that way.

We chose to invest and to build Hubble to solve these problems at Shopify. Encapsulation can keep you secure while keeping everyone sane at the same time.

Tony is a Senior Engineering Program Manager and leads a team focussed on automation and internal support technologies. He’s journaled daily for more than 9 years, and uses it as a fun corpus for natural language analysis. He likes finding old bread recipes and seeing how baking has evolved over time!


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

How to Structure Your Data Team for Maximum Influence

How to Structure Your Data Team for Maximum Influence

One of the biggest challenges most managers face (in any industry) is trying to assign their reports work in an efficient and effective way. But as data science leaders—especially those in an embedded model—we’re often faced with managing teams with responsibilities that traverse multiple areas of a business. This juggling act often involves different streams of work, areas of specialization, and stakeholders. For instance, my team serves five product areas, plus two business areas. Without a strategy for dealing with these stakeholders and related areas of work, we risk operational inefficiency and chaotic outcomes. 

There are many frameworks out there that suggest the most optimal way to structure a team for success. Below, we’ll review these frameworks and their positives and negatives when applied to a data science team. We’ll also share the framework that’s worked best for empowering our data science teams to drive impact.

An example of the number of product and business areas my team supports at Shopify
An example of the number of product and business areas my data team supports at Shopify

First, Some Guiding Principles

Before looking at frameworks for managing these complex team structures, I’ll first describe some effective guiding principles we should use when organizing workflows and teams:  

  1. Efficiency: Any structure must provide an ability to get work done in an efficient and effective manner. 
  2. Influence: Structures must be created in such a way that your data science team continues to have influence on business and product strategies. Data scientists often have input that is critical to business and product success, and we want to create an environment where that input can be given and received.
  3. Stakeholder clarity: We need to create a structure where stakeholders clearly know who to contact to get work done, and seek help and advice from.
  4. Stability: Some teams structures can create instability for reports, which leads to a whole host of other problems.
  5. Growth: If we create structures where reports only deal with stakeholders and reactive issues, it may be difficult for them to develop professionally. We want to ensure reports have time to tackle work that enables them to acquire a depth of knowledge in specific areas.
  6. Flexibility: Life happens. People quit, need change, or move on. Our team structures need to be able to deal with and recognize that change is inevitable. 

Traditional Frameworks for Organizing Data Teams

Alright, now let’s look at some of the more popular frameworks used to organize data teams. While they’re not the only ways to structure teams and align work, these frameworks cover most of the major aspects in organizational strategy. 

Swim Lanes

You’ve likely heard of this framework before, and maybe even cringed when someone has told you or your report to "stay in your swim lanes". This framework involves assigning someone to very strictly defined areas of responsibility. Looking at the product and business areas my own team supports as an example, we have seven different groups to support. According to the swim lane framework, I would assign one data scientist to each group. With an assigned product or business group, their work would never cross lanes. 

In this framework, there's little expected help or cross-training that occurs, and everyone is allowed to operate with their own fiefdom. I once worked in an environment like this. We were a group of tenured data scientists who didn’t really know what the others were doing. It worked for a while, but when change occurred (new projects, resignations, retirements) it all seemed to fall apart.  

Let’s look at this framework’s advantages: 

  • Distinct areas of responsibility. In this framework, everyone has their own area of responsibility. As a manager, I know exactly who to assign work to and where certain tasks should go on our board. I can be somewhat removed from the process of workload balancing.  
  • High levels of individual ownership. Reports own an area of responsibility and have a stake in its success. They also know that their reputation and job are on the line for the success or failure of that area.
  • The point-of-contact is obvious to stakeholders. Ownership is very clear to stakeholders, so they always know who to go. This model also fosters long-term relationships. 

And the disadvantages:

  • Lack of cross-training. Individual reports will have very little knowledge of the work or codebase of their peers. This becomes an issue when life happens and we need to react to change.
  • Reports can be left on an island. Reports can be left alone which tends to matter more when times are tough. This is a problem for both new reports who are trying to onboard and learn new systems, but also for tenured reports who may suddenly endure a higher workload. Help may not be coming.  
  • Fails under high-change environments. For the reasons mentioned above, this system fails under high-change environments. It also creates a team-level rigidity that means when general organizational changes happen, it’s difficult to react and pivot.

Referring back to our guiding principles when considering how to effectively organize a date team, this framework hits our stakeholder clarity and efficiency principles, but only in stable environments. Swim lanes often fail in conditions of change or when the team needs to pivot to new responsibilities—something most teams should expect.  

Stochastic Process

As data scientists, we’re often educated in the stochastic process and this framework resembles this theory. As a refresher, the stochastic process is defined by randomness of assignment, where expected behavior is near random assignments to areas or categories.  

Likewise, in this framework each report takes the next project that pops up, resembling a random assignment of work. However, projects are prioritized and when an employee finishes one project, they take on the next, highest priority project. 

This may sound overly random as a system, but I’ve worked on a team like this before. We were a newly setup team, and no one had any specific experience with any of the work we were doing. The system worked well for about six months, but over the course of a year, we felt like we'd been put through the wringer and as though no one had any deep knowledge of what we were working on.  

The advantages of this framework are:

  • High levels of team collaboration. Everyone is constantly working on each other’s code and projects, so a high-level of collaboration tends to develop.
  • Reports feel like there is always help. Since work is assigned in terms of next priority gets the resource, if someone is struggling with a high-priority task, they can just ask for help from the next available resource.
  • Extremely flexible under high levels of change. Your organization decides to reorg to align to new areas of the business? No problem! You weren’t aligned to any specific groups of stakeholders to begin with. Someone quits? Again, no problem. Just hire someone new and get them into the rotation.

And the disadvantages:

  • Can feel like whiplash. As reports are asked to move constantly from one unrelated project to the next, they can develop feelings of instability and uncertainty (aka whiplash). Additionally, as stakeholders work with a new resource on each project, this can limit the ability to develop rapport.
  • Inability to go deep on specialized subject matters. It’s often advantageous for data scientists to dive deep into one area of the business or product. This enables them to develop deep subject area knowledge in order to build better models. If we’re expecting them to move from project to project, this is unlikely to occur.
  • Extremely high management inputs. As data scientists become more like cogs in a wheel in this type of framework, management ends up owning most stakeholder relationships and business knowledge. This increases demands on individual managers.

Looking at the advantages and disadvantages of this framework, and measuring them against our guiding principles, this framework only hits two of our principles: flexibility and efficiency. While this framework can work in very specific circumstances (like brand new teams), the lack of stakeholder clarity, relationship building, and growth opportunity will result in the failure of this framework to sufficiently serve the needs of the team and stakeholders. 

A New Framework: Diamond Defense 

Luckily, we’ve created a third way to organize data teams and work. I like to compare this framework to the concept of diamond defense in basketball. In diamond defense, players have general areas (zones) of responsibility. However, once play starts, the defense focuses on trapping (sending extra resources) to the toughest problems, while helping out areas in the defense that might be left with fewer resources than needed.  

This same defense method can be used to structure data teams to be highly effective. In this framework, you loosely assign reports to your product or business areas, but ensure to rotate resources to tough projects and where help is needed.

Referring back to the product and business areas my team supports, you can see how I use this framework to organize my team: 

An example of how I use the diamond defense framework to structure my data team and align them to zones of work
An example of how I use the diamond defense framework to structure my data team

Each data scientist is assigned to a zone. I then aligned our additional business areas (Finance and Marketing) to a product group, and assigned resources to these groupings. Finance and Marketing are aligned differently here because they are not supported by a team of Software Engineers. Instead, I aligned them to the product group that mostly closely resembles their work in terms of data accessed and models built. Currently, Marketing has the highest number of requests for our team, so I added more resources to support this group. 

You’ll notice on the chart that I keep myself and an additional data scientist in a bullpen. This is key to the diamond defense as it ensures we always have additional resources to help out where needed. Let’s dive into some examples of how we may use resources in the bullpen: 

  1. DS2 is under-utilized. We simultaneously find out that DS1 is overwhelmed by the work of their product area, so we tap DS2 to help out. 
  2. SR DS1 quits. In this case, we rotate DS4 into their place, and proceed to hire a backfill. 
  3. SR DS2 takes a leave of absence. In this situation, I as the manager slide in to manage SR DS2’s stakeholders. I would then tap DS4 to help out, while the intern who is also assigned to the same area continues to focus on getting their work done with help from DS4. 

This framework has several advantages:

  • Everyone has dedicated areas to cover and specialize in. As each report is loosely assigned to a zone (specific product or business area), they can go deep and develop specialized skills.  
  • Able to quickly jump on problems that pop up. Loose assignment to zones enable teams the flexibility to move resources to the highest-priority areas or toughest problems.  
  • Reports can get the help they need. If a report is struggling with the workload, you can immediately send more resources towards that person to lighten their load.  

And the disadvantages:

  • Over-rotation. In certain high-change circumstances, a situation can develop where data scientists spend most of their time covering for other people. This can create very volatile and high-risk situations, including turnover.

This framework hits all of our guiding principles. It provides the flexibility and stability needed when dealing with change, it enables teams to efficiently tackle problems, focus areas enable report growth and stakeholder clarity, and relationships between reports and their stakeholders improves the team's ability to influence policies and outcomes. 

Conclusion

There are many ways to organize data teams to different business or product areas, stakeholders, and bodies of work. While the traditional frameworks we discussed above can work in the short-term, they tend to over-focus either on rigid areas of responsibility or everyone being able to take on any project. 

If you use one of these frameworks and you’re noticing that your team isn’t working as effectively as you know they can, give our diamond defense framework a try. This hybrid framework addresses all the gaps of the traditional frameworks, and ensures:

  • Reports have focus areas and growth opportunity 
  • Stakeholders have clarity on who to go to
  • Resources are available to handle any change
  • Your data team is set up for long-term success and impact

Every business and team is different, so we encourage you to play around with this framework and identify how you can make it work for your team. Just remember to reference our guiding principles for complex team structures.

Levi manages the Banking and Accounting data team at Shopify. He enjoys finding elegant solutions to real-world business problems using math, machine learning, and elegant data models. In his spare time he enjoys running, spending time with his wife and daughters, and farming. Levi can be reached via LinkedIn.

Are you passionate about solving data problems and eager to learn more about Shopify? Check out openings on our careers page.

Continue reading

Finding Relationships Between Ruby’s Top 100 Packages and Their Dependencies

Finding Relationships Between Ruby’s Top 100 Packages and Their Dependencies

In June of this year, RubyGems, the main repository for Ruby packages (gems), announced that multi-factor authentication (MFA) was going to be gradually rolled out to users. This means that users eventually will need to login with a one-time password from their authenticator device, which will drastically reduce account takeovers.

The team I'm interning on, the Ruby Dependency Security team at Shopify, played a big part in rolling out MFA to RubyGems users. The team’s mission is to increase the security of the Ruby software supply chain, so increasing MFA usage is something we wanted to help implement.

A large Ruby with stick arms and leg pats a little Ruby with stick arms and legs
Illustration by Kevin Lin

One interesting decision that the RubyGems team faced is determining who was included in the first milestone. The team wanted to include at least the top 100 RubyGems packages, but also wanted to prevent packages (and people) from falling out of this cohort in the future.

To meet those criteria, the team set a threshold of 180 million downloads for the gems instead. Once a gem crosses 180 million downloads, its owners are required to use multi-factor authentication in the future.

Bar graph showing gem download numbers for Gem 1 and Gem 2
Gem downloads represented as bars. Gem 2 is over the 180M download threshold, so its owners would need MFA.

This design decision led me to a curiosity. As packages frequently depend on other packages, could some of these big (more than 180M downloads) packages depend on small (less than 180M downloads) packages? If this was the case, then there would be a small loophole: if a hacker wanted to maximize their reach in the Ruby ecosystem, they could target one of these small packages (which would get installed every time someone installed one of the big packages), circumventing the MFA protection of the big packages.

On the surface, it might not make sense that a dependency would ever have fewer downloads than its parent. After all, every time the parent gets downloaded, the dependency does too, so surely the dependency has at least as many downloads as the parent, right?

Screenshot of a Slack conversation between coworkers discussing one's scepticism about finding exceptions
My coworker Jacques, doubting that big gems will rely on small gems. He tells me he finds this hilarious in retrospect.

Well, I thought I should try to find exceptions anyway, and given that this blog post exists, it would seem that I found some. Here’s how I did it.

The Investigation

The first step in determining if big packages depended on small packages was to get a list of big packages. The rubygems.org stats page shows the top 100 gems in terms of downloads, but the last gem on page 10 has 199 million downloads, meaning that scraping these pages would yield an incomplete list, since the threshold I was interested in is 180 million downloads.

A screenshot of a page of Rubygems.org statistics
Page 10 of https://rubygems.org/stats, just a bit above the MFA download threshold

To get a complete list, I instead turned to using the data dumps that rubygems.org makes available. Basically, the site takes a daily snapshot of the rubygems.org database, removes any confidential information, and then publishes it. Their repo has a convenient script that allows you to load these data dumps into your own local rubygems.org database, and therefore run queries on the data using the Rails console. It took me many tries to make a query that got all the big packages, but I eventually found one that worked:

Rubygem.joins(:gem_download).where(gem_download: {count: 180_000_000..}).map(&:name)

I now had a list of 112 big gems, and I had to find their dependencies. The first method I tried was using the rubygems.org API. As described in the documentation, you can give the API the name of a gem and it’ll give you the name of all of its dependencies as part of the response payload. The same endpoint of this API also tells you how many downloads a gem has, so the path was clear: for each big gem, get a list of its dependencies and find out if any of them had fewer downloads than the threshold.

Here are the functions that get the dependencies and downloads:

Ruby function that gets a list of dependencies as reported by the rubygems.org API. Requires built-in uri, net/http, and json packages.
Ruby function that gets downloads from the same rubygems.org API endpoint. Also has a branch to check the download count for specific versions of gems, that I later used.

Putting all of this together, I found that 13 out of the 112 big gems had small gems as dependencies. Exceptions! So why did these small gems have fewer downloads than their parents? I learned that it was mainly due to two reasons:

  1. Some gems are newer than their parents, that is, a new gem came out and a big gem developer wanted to add it as a dependency.
  2. Some gems are shipped with Ruby by default, so they don’t need to be downloaded and thus have low(er) download count (for example, racc and rexml).

With this, I now had proof of the existence of big gems that would be indirectly vulnerable to account takeover of a small gem. While an existence proof is nice, it was pointed out to me that the rubygems.org API only returns a list symbolic of the direct dependencies of a gem, and that those dependencies might have sub-dependencies that I wasn’t checking. So how could I find out which packages get installed when one of these big gems gets installed?

With Bundler, of course!

Bundler is the Ruby dependency manager software that most Ruby users are probably familiar with. Bundler takes a list of gems to install (the Gemfile), installs dependencies that satisfy all version requirements, and, crucially for us, makes a list of all those dependencies and versions in a Gemfile.lock file. So, to find out which big gems relied in any way on small gems, I programmatically created a Gemfile with only the big gem in it, programmatically ran bundle lock, and programmatically read the Gemfile.lock that was created to get all the dependencies.

Here’s the function that did all the work with Bundler:

Ruby function that gets all dependencies that get installed when one gem is installed using Bundler

With this new methodology, I found that 24 of the 112 big gems rely on small gems, which is a fairly significant proportion of them. After discovering this, I wanted to look into visualization. Up until this point, I was just printing out results to the command line to make text dumps like this:

Text dump of dependency results. Big gems are red, their dependencies that are small are indented in black
Text dump of dependency results. Big gems are red, their dependencies that are small are indented in black

This visualization isn’t very convenient to read, and it misses out on patterns. For example, as you can see above, many big gems rely on racc. It would be useful to know if they relied directly on it, or if most packages depended on it indirectly through some other package. The idea of making a graph was in the back of my mind since the beginning of this project, and when I realized how helpful it might be, I committed to it. I used the graph gem, following some examples from this talk by Aja Hammerly. I used a breadth-first search, starting with a queue of all the big gems, adding direct dependencies to the queue as I went. I added edges from gems to their dependencies and highlighted small gems in red. Here was the first iteration:

The output of the graph gem that highlights gem dependencies
The first iteration

It turns out there a lot of AWS gems, so I decided to remove them from the graph and got a much nicer result:

The output of the graph gem that highlights gem dependencies
Full size image link if you want to zoom and pan

The graph, while moderately cluttered, shows a lot of information succinctly. For instance, you can see a galaxy of gems in the middle-left, with rails being the gravitational attractor, a clear keystone in the Ruby world.

Output of the gem graph with Rails at the center
The Rails galaxy

The node with the most arrows pointing into it is activesupport, so it really is an active support.

A close up view of activesupport in the output of the gem graph. activesupport has many arrows pointing into it.
14 arrows pointing into activesupport

Racc, despite appearing in my printouts as a small gem for many big gems, is only the dependency of nokogiri.

A close up view of racc in the output of the gems graph
racc only has 1 edge attached to it

With this nice graph created, I followed up and made one final printout. This time, whenever I found a big gem that depended on a small gem, I printed out all the paths on the graph from the big gem to the small gem, that is, all the ways that the big gem relied on the small gem.

Here’s an example printout:

Big gem is in green (googleauth), small gems are in purple, and the black lines are all the paths from the big gem to the small gem.

I achieved this by making a directional graph data type and writing a depth-first search algorithm to find all the paths from one node to another. I chose to create my own data type because finding all paths on a graph isn’t already implemented in any Ruby gem from what I could tell. Here’s the algorithm, if you’re interested (`@graph` is a Hash of `String:Array` pairs, essentially an adjacency list):

Recursive depth-first search to find all paths from start to end

What’s Next

In summary, I found four ways to answer the question of whether or not big gems rely on small gems:

  1. direct dependency printout (using rubygems.org API)
  2. sub-dependency printout (using Bundler)
  3. graph (using graph gem)
  4. sub-dependency printout with paths (2. using my own graph data type).

I’m happy with my work, and I’m glad I got to learn about file I/O and use graph theory. I’m still relatively new to Ruby, so offshoot projects like these are very didactic.

The question remains of what to do with the 24 technically insecure gems. One proposal is to do nothing, since everyone will eventually need to have MFA enabled, and account takeover is still an uncommon event despite being on the rise. 

Another option is to enforce MFA on these specific gems as a sort of blocklist, just to ensure the security of the top gems sooner. This would mean a small group of owners would have to enable MFA a few months earlier, so I could see this being a viable option. 

Either way, more discussion with my team is needed. Thanks for reading!

Kevin is an intern on the Ruby Dependency Security team at Shopify. He is in his 5th year of Engineering Physics at the University of British Columbia.


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

How to Write Code Without Having to Read It

How to Write Code Without Having to Read It

Do we need to read code before editing it?

The idea isn’t as wild as it sounds. In order to safely fix a bug or update a feature, we may need to learn some things about the code. However, we’d prefer to learn only that information. Not only does extra reading waste time, it overcomplicates our mental model. As our model grows, we’re more likely to get confused and lose track of critical info.

But can we really get away with reading nothing? Spoiler: no. However, we can get closer by skipping over areas that we know the computer is checking, saving our focus for areas that are susceptible to human error. In doing so, we’ll learn how to identify and eliminate those danger areas, so the next person can get away with reading even less.

Let’s give it a try.

Find the Entrypoint

If we’re refactoring code, we already know where we need to edit. Otherwise, we’re changing a behavior that has side effects. In a backend context, these behaviors would usually be exposed APIs. On the frontend, this would usually be something that’s displayed on the screen. For the sake of example, we’ll imagine a mobile application using React Native and Typescript, but this process generalizes to other contexts (as long as they have some concept of build or test errors; more on this later).

If our goal was to read a lot of code, we might search for all hits on RelevantFeatureName. But we don’t want to do that. Even if we weren’t trying to minimize reading, we’ll run into problems if the code we need to modify is called AlternateFeatureName, SubfeatureName, or LegacyFeatureNameNoOneRemembersAnymore.

Instead, we’ll look for something external: the user-visible strings (including accessibility labels—we did remember to add those, right?) on the screen we’re interested in. We search various combinations of string fragments, quotation marks, and UI inspectors until we find the matching string, either in the application code or in a language localization file. If we’re in a localization file, the localization key leads us to the application code that we’re interested in.

Tip
If we’re dealing with a regression, there’s an easier option: git bisect. When git bisect works, we really don’t need to read the code. In fact, we can skip most of the following steps. Because this is such a dramatic shortcut, always keep track of which bugs are regressions from previously working code.

Make the First Edit

If we’ve come in to make a simple copy edit, we’re done. If not, we’re looking for a component that ultimately gets populated by the server, disk, or user. We can no longer use exact strings, but we do have several read-minimizing strategies for zeroing in on the component:

  1. Where is this component on the screen, relative to our known piece of text?
  2. What type of standard component is it using? Is it a button? Text input? Text?
  3. Does it have some unusual style parameter that’s easy to search? Color? Corner radius? Shadow?
  4. Which button launches this UI? Does the button have searchable user-facing text?

These strategies all work regardless of naming conventions and code structure. Previous developers would have a hard time making our life harder without making the code nonfunctional. However, they may be able to make our life easier with better structure. 

For example, if we’re using strategy #1, well-abstracted code helps us quickly rule out large areas of the screen. If we’re looking for some text near the bottom of the screen, it’s much easier to hit the right Text item if we can leverage a grouping like this:

<SomeHeader />
<SomeContent />
<SomeFooter />

rather than being stuck searching through something like this:

// Header
<StaticImage />
<Text />
<Text />
<Button />
<Text />
...
// Content
...
// Footer
...

where we’ll have to step over many irrelevant hits.

Abstraction helps even if the previous developer chose wacky names for header, content, or footer, because we only care about the broad order of elements on the screen. We’re not really reading the code. We’re looking for objective cues like positioning. If we’re still unsure, we can comment out chunks of the screen, starting with larger or highly-abstracted components first, until the specific item we care about disappears.

Once we’ve found the exact component that needs to behave differently, we can make the breaking change right now, as if we’ve already finished updating the code. For example, if we’re making a new component that displays data newText, we add that parameter to its parent’s input arguments, breaking the build.

If we’re fixing a bug, we can also start by adjusting an argument list. For example, the condition “we shouldn’t be displaying x if y is present” could be represented with the tagged union {mode: 'x', x: XType} | {mode: 'y'; y: YType}, so it’s physically impossible to pass in x and y at the same time. This will also trigger some build errors.

Tagged Unions
Tagged unions go by a variety of different names and syntaxes depending on language. They’re most commonly referred to as discriminated unions, enums with associated values, or sum types.

Climb Up the Callstack

We now go up the callstack until the build errors go away. At each stage, we edit the caller as if we’ll get the right input, triggering the next round of build errors. Notice that we’re still not reading the code here—we’re reading the build errors. Unless a previous developer has done something that breaks the chain of build errors (for example, accepting any instead of a strict type), their choices don’t have any effect on us.

Once we get to the top of the chain, we adjust the business logic to grab newText or modify the conditional that was incorrectly sending x. At this point, we might be done. But often, our change could or should affect the behavior of other features that we may not have thought about. We need to sweep back down through the callstack to apply any remaining adjustments. 

On the downswing, previous developers’ choices start to matter. In the worst case, we’ll need to comb through the code ourselves, hoping that we catch all the related areas. But if the existing code is well structured, we’ll have contextual recommendations guiding us along the way: “because you changed this code, you might also like…”

Update Recommended Code

As we begin the downswing, our first line of defense is the linter. If we’ve used a deprecated library, or a pattern that creates non-obvious edge cases, the linter may be able to flag it for us. If previous developers forgot to update the linter, we’ll have to figure this out manually. Are other areas in the codebase calling the same library? Is this pattern discouraged in documentation?

After the linter, we may get additional build errors. Maybe we changed a function to return a new type, and now some other consumer of that output raises a type error. We can then update that other consumer’s logic as needed. If we added more cases to an enum, perhaps we get errors from other exhaustive switches that use the enum, reminding us that we may need to add handling for the new case. All this depends on how much the previous developers leaned on the type system. If they didn’t, we’ll have to find these related sites manually. One trick is to temporarily change the types we’re emitting, so all consumers of our output will error out, and we can check if they need updates.

Exhaustiveness
An exhaustive switch statement handles every possible enum case. Most environments don’t enforce exhaustiveness out of the box. For example, in Typescript, we need to have strictNullChecks turned on, and ensure that the switch statement has a defined return type. Once exhaustiveness is enforced, we can remove default cases, so we’ll get notified (with build errors) whenever the enum changes, reminding us that we need to reassess this switch statement.

Our final wave of recommendations comes from unit test failures. At this point, we may also run into UI and integration tests. These involve a lot more reading than we’d prefer; since these tests require heavy mocking, much of the text is just noise. Also, they often fail for unimportant reasons, like timing issues and incomplete mocks. On the other hand, unit tests sometimes get a bad rap for requiring code restructures, usually into more or smaller abstraction layers. At first glance, it can seem like they make the application code more complex. But we didn’t need to read the application code at all! For us, it’s best if previous developers optimized for simple, easy-to-interpret unit tests. If they didn’t, we’ll have to find these issues manually. One strategy is to check git blame on the lines we changed. Maybe the commit message, ticket, or pull request text will explain why the feature was previously written that way, and any regressions we might cause if we change it.

At no point in this process are comments useful to us. We may have passed some on the upswing, noting them down to address later. Any comments that are supposed to flag problems on the downswing are totally invisible—we aren’t guaranteed to find those areas unless they’re already flagged by an error or test failure. And whether we found comments on the upswing or through manual checking, they could be stale. We can’t know if they’re still valid without reading the code underneath them. If something is important enough to be protected with a comment, it should be protected with unit tests, build errors, or lint errors instead. That way it gets noticed regardless of how attentive future readers are, and it’s better protected against staleness. This approach also saves mental bandwidth when people are touching nearby code. Unlike standard comments, test assertions only pop when the code they’re explaining has changed. When they’re not needed, they stay out of the way.

Clean Up

Having mostly skipped the reading phase, we now have plenty of time to polish up our code. This is also an opportunity to revisit areas that gave us trouble on the downswing. If we had to read through any code manually, now’s the time to fix that for future (non)readers.

Update the Linter

If we need to enforce a standard practice, such as using a specific library or a shared pattern, codify it in the linter so future developers don’t have to find it themselves. This can trigger larger-scale refactors, so it may be worth spinning off into a separate changeset.

Lean on the Type System

Wherever practical, we turn primitive types (bools, numbers, and strings) into custom types, so future developers know which methods will give them valid outputs to feed into a given input. A primitive like timeInMilliseconds: number is more vulnerable to mistakes than time: MillisecondsType, which will raise a build error if it receives a value in SecondsType. When using enums, we enforce exhaustive switches, so a build error will appear any time a new case may need to be handled.

We also check methods for any non-independent arguments:

  • Argument A must always be null if Argument B is non-null, and vice versa (for example, error/response).
  • If Argument A is passed in, Argument B must also be passed in (for example, eventId/eventTimestamp).
  • If Flag A is off, Flag B can’t possibly be on (for example, visible/highlighted).

If these arguments are kept separate, future developers will need to think about whether they’re passing in a valid combination of arguments. Instead, we combine them, so the type system will only allow valid combinations:

  • If one argument must be null when the other is non-null, combine them into a tagged union: {type: 'failure'; error: ErrorType} | {type: 'success'; response: ResponseType}.
  • If two arguments must be passed in together, nest them into a single object: event: {id: IDType; timestamp: TimestampType}.
  • If two flags don’t vary independently, combine them into a single enum: 'hidden'|'visible'|'highlighted'.

Optimize for Simple Unit Tests

When testing, avoid entanglement with UI, disk or database access, the network, async code, current date and time, or shared state. All of these factors produce or consume side effects, clogging up the tests with setup and teardown. Not only does this spike the rate of false positives, it forces future developers to learn lots of context in order to interpret a real failure.

Instead, we want to structure our code so that we can write simple tests. As we saw, people can often skip reading our application code. When test failures appear, they have to interact with them. If they can understand the failure quickly, they’re more likely to pay attention to it, rather than adjusting the failing assertion and moving on. If a test is starting to get complicated, go back to the application code and break it into smaller pieces. Move any what code (code that decides which side effects should happen) into pure functions, separate from the how code (code that actually performs the side effects). Once we’re done, the how code won’t contain any nontrivial logic, and the what code can be tested—and therefore documented—without complex mocks.

Trivial vs. Nontrivial Logic
Trivial logic would be something like if (shouldShow) show(). Something like if (newUser) show() is nontrivial (business) logic, because it’s specific to our application or feature. We can’t be sure it’s correct unless we already know the expected behavior.

Whenever we feel an urge to write a comment, that’s a signal to add more tests. Split the logic out into its own unit tested function so the “comment” will appear automatically, regardless of how carefully the next developer is reading our code.

We can also add UI and integration tests, if desired. However, be cautious of the impulse to replace unit tests with other kinds of tests. That usually means our code requires too much reading. If we can’t figure out a way to run our code without lengthy setup or mocks, humans will need to do a similar amount of mental setup to run our code in their heads. Rather than avoiding unit tests, we need to chunk our code into smaller pieces until the unit tests become easy.

Confirm

Once we’ve finished polishing our code, we manually test it for any issues. This may seem late, but we’ve converted many runtime bugs into lint, build, and test errors. Surprisingly often, we’ll find that we’ve already handled all the edge cases, even if we’re running the code for the first time.

If not, we can do a couple more passes to address the lingering issues… adjusting the code for better “unread”-ability as we go.

Tip
Sometimes, our end goal really is to read the code. For example, we might be reviewing someone else’s code, verifying the current behavior, or ruling out bugs. We can still pose our questions as writes:

  • Could a developer have done this accidentally, or does the linter block it when we try?
  • Is it possible to pass this bad combination of arguments, or would that be rejected at build time?
  • If we hardcode this value, which features (represented by unit tests) would stop working?

JM Neri is a senior mobile developer on the Shop Pay team, working out of Colorado. When not busy writing unit tests or adapting components for larger text sizes, JM is usually playing in or planning for a TTRPG campaign.


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

A Flexible Framework for Effective Pair Programming

A Flexible Framework for Effective Pair Programming

Pair programming is one of the most important tools we use while mentoring early talent in the Dev Degree program. It’s an agile software development technique where two people work together, either to share context, solve a problem, or learn from one another. Pairing builds technical and communication skills, encourages curiosity and creative problem-solving, and brings people closer together as teammates.

In my role as a Technical Educator, I’m focused on setting new interns joining the Dev Degree program up for success in their first 8 months at Shopify. Because pair programming is a method we use so frequently in onboarding, I saw an opportunity to streamline the process to make it more approachable for people who might not have experienced it before. I developed this framework during a live workshop I hosted at RenderATL. I hope it helps you structure your next pair programming session!

Pair Programming in Dev Degree

“Pair programming was my favorite weekly activity while on the Training Path. When I first joined my team, I was constantly pairing with my teammates. My experiences on the Training Path made these new situations manageable and incredibly fun! It was also a great way to socialize and work technically with peers outside of my school cohort that I wouldn't talk to often. I've made some good friends just working on a little project for the weekly module.” - Mikail Karimi, Dev Degree Intern

In the Dev Degree program, we use mentorship to build up the interns’ technical and soft skills. As interns are usually early in their career journey, having recently graduated from high school or switched careers, their needs differ from someone with more career experience. Mentorship from experienced developers is crucial to prepare interns for their first development team placement. It shortens the time it takes to start making a positive impact with their work by developing their technical and soft skills like problem solving and communication. This is especially important now that Shopify is digital by design, as learning and working remotely is a completely new experience to many interns. 

All first year Dev Degree interns at Shopify go through an eight-month period known as the Training Path. During this period, we deliver a set of courses designed to teach them all about Shopify-specific development. They’re mentored by Student Success Specialists, who coach them on building their soft skills like communication, and Technical Instructors , who focus on the technical aspects of the training. Pair programming with a peer or mentor is a great way to support both of these areas of development.

Each week, we allocate two to three hours for interns to pair program with each other on a problem. We don’t expect them to solve the problem completely, but they should use the concepts they learned from the week to hone their technical craft.

We also set up a bi-weekly 30 minute pair programming sessions with each intern. The purpose of these sessions is to provide dedicated one-on-one time to learn and work directly with an instructor. They can share what they are having trouble with, and we help them work through it.

“When I’m switching teams and disciplines, pair programming with my new team is extremely helpful to see what resources people use to debug, the internal tools they use to find information and how they approach a problem. On my current placement, I got better at resolving problems independently when I saw how my mentor handled a new problem neither of us had seen.” Sanaa Syed, Dev Degree Intern

As we scale up the program, there are some important questions I keep returning to:

  • How do we track their progress most effectively?
  • How do we know what they want to pair on each day?
  • How can we provide a safe space?
  • What are some best practices for communicating?

I started working on a framework to help solve these issues. I know I’m not the only one on my team who may be asking themselves this. Along the way, an opportunity arose to do a workshop at RenderATL. At Shopify, we are encouraged to learn as part of our professional development. Wanting to level up my public speaking skills, I decided to talk about mentorship through a pair programming lens. As the framework was nearly completed, I decided to crowdsource and finish the framework together with the RenderATL attendees.

Crowdsourcing a Framework for All

On June 1st, 2022, Shopify hosted free all-day workshops at RenderATL called Heavy Hitting React at Shopify Engineering. It contained five different workshops, covering a range of topics from specific technical skills like React Native to broader skills like communication. We received a lot of positive feedback, met many amazing folks, and made sure those who attended gained new knowledge or skills they could walk away with.

For my workshop, Let’s Pair Program a Framework Together, we pair programmed a pair programming framework. The goal was to crowdsource and finish the pair programming framework I was working on based on the questions I mentioned above. We had over 30 attendees, and the session was structured to be interactive. I walked the audience through the framework and got their suggestions on the unfinished parts of the framework. At the end, the attendees paired up and used the framework to work together and draw a picture they both wanted to draw.

Before the workshop, I sent a survey internally asking developers a few questions about pair programming. Here are the results:

  • 62.5% had over 10 years of programming experience
  • 78.1% had pair programmed before joining Shopify
  • 50% of the surveyor pair once or twice a week at Shopify

When asked “What is one important trait to have when pair programming?”, this is what Shopify developers had to say:

Communication

  • Expressing thought processes (what you’re doing, why you’re making this change, etc.)
  • Sharing context to help others get a thorough understanding
  • Use of visual aids to assist with explanation

Empathy

  • Being aware of energy levels
  • Not being judgemental to others

Open-mindedness

  • Curious to learn
  • Willingness to take feedback and improve
  • Don’t adhere to just one’s opinion

Patience

  • Providing time to allow your partner to think and formulate opinions
  • Encourage repetition of steps, instructions to encourage question asking and learn by doing

Now, let’s walk through the crowdsourced framework we finished at RenderATL.

Note: For those who attended the workshop, the framework below is the same framework that you walked away with, but with more details and resources.

The Framework

A women with her back to the viewer writing on a whiteboard
Pair programming can be used for more than just writing code. Try pairing on other problems and using tools like a whiteboard to work through ideas together.

This framework covers everything you need to run a successful pair programming session, including: roles, structure, agenda, environment, and communication. You can pick and choose within each section to design your session based on your needs.

1a. Pair Programming Styles

There are many different ways to run a pair programming session. Here are the ones we found to be the most useful, and when you may want to use each depending on your preferences and goals.

Driver and Navigator

Think about this style like a long road trip. One person is focused on driving to get from point A to B. While the other person is providing directions, looking for future pit stops for breaks, and just observing the surroundings. As driving can be taxing, it’s a good idea to switch roles frequently.

The driver is the person leading the session and typing on the keyboard. As they are typing, they’re explaining their thought process. The navigator, also known as the observer, is the person observing, reviewing code that’s being written, and making suggestions along the way. For example, suggesting refactoring code and thinking about potential edge cases.

If you’re an experienced person pairing with an intern or junior developer, I recommend using this style after you paired together a few sessions. They’re likely still gaining context and getting comfortable with the code base in the first few sessions.

Tour Guide

This style is like giving someone a personal tour of the city. The experienced person drives most of the session, hence the title tour guide. While the partner is just observing and asking questions along the way.

I suggest using this style when working with someone new on your team. It’s a great way to give them a personal tour to how your team’s application works and share context along the way. You can also flip it, where the least experienced person is the tour guide. I like to do this with the Dev Degree interns who are a bit further into their training when I pair with them. I find it helps bring out their communication skills once they’ve started to gain some confidence in their technical abilities.

Unstructured

The unstructured style is more of a freestyle way to work on something together, like learning a new language or concept. The benefit of the unstructured style is the team building and creative solutions that can come from two people hacking away and figuring things out. This is useful when a junior developer or intern pairs with someone at their level. The only downside is that without a mentor overseeing them, there’s a risk of missteps or bad habits going unchecked. This can be solved after the session by having them share their findings with a mentor for discussion.

We allocated time for the interns to pair together. This is the style the interns typically go with, figuring things out using the concepts they learned.

1b. Activities

When people think of pair programming, they often strictly think of coding. But pair programming can be effective for a range of activities. Here are some we suggest.

Code Reviews

I remember when I first started reviewing code, I wasn’t sure what exactly I was meant to be looking for. Having an experienced mentor support with code reviews helps early talent pick up context and catch things that they might not otherwise know to look for. Interns also bring a fresh perspective, which can benefit mentors as well by prompting them to unpack why they might make certain decisions.

Technical Design and Documentation

Working together to design a new feature or go through team documents. If you put yourself in a junior developer’s shoes, what would that look like? It could look like a whiteboarding session mapping out logic for the new feature or improving the team’s onboarding documentation. This could be an incredibly impactful session for them. You’ll be helping broaden their technical depth, helping future teammates onboard faster, and sharing your expertise along the way.

Writing Test Cases Only

Imagine you’re a junior developer working on your very first task. You have finished writing the functionality, but haven’t written any tests for it. You tested it manually and know it works. One thing you’re trying to figure out is how to write a test for it now. This is where a pair programming session with an experienced developer is beneficial. You work together to extend testing coverage and learn team-specific styles when writing tests.

Onboarding

Pairing is a great way to help onboard someone new onto your team. It helps the new person joining your team ramp up quicker with your wealth of knowledge. Together you explore the codebase, documentation, and team-specific rituals.

Hunting Bugs

Put yourselves in your users’ shoes and go on a bug hunt together. As you test functionalities on production, you'll gain context on the product and reduce the number of bugs on your application. A win-win!

2. Set an Agenda

A photo of a women writing in a notebook on a desk with a coffee and croissant
Pair programming can be used for more than just writing code. Try pairing on other problems and using tools like a whiteboard to work through ideas together.

Setting an agenda beforehand is key to making sure your session is successful.

Before the session, work with your partner to align on the style, activity, and goals for your pair programming session. This way you can hit the ground running while pairing and work together to achieve a common goal. Here are some questions you can use to set your agenda: 

  • What do you want to pair on today?
  • How do you want the session to go? You drive? I drive? Or both?
  • Where should we be by the end of the session?
  • Is there a specific skill you want to work on?
  • What’s blocking you?

3. Set the Rules of Engagement

A classroom with two desks and chairs in the foreground 
Think of your pair programming session like a classroom. How would you make sure it’s a great environment to learn?

"After years of frequent pair programming, my teammates and I have established patterns that give the impression that we are always right next to each other, which makes context sharing and learning from peers much simpler." -Olakitan Bello, Developer

Now that your agenda is set, it’s time to think about the environment you want to have during the session. Imagine yourself as a teacher. If this was a classroom, how would you provide the best learning environment?

Be Inclusive

Everyone should feel welcomed and invited to collaborate with others. One way we can set the tone is to establish that “There are no wrong answers here” or “There are no dumb questions.” If you’re a senior colleague, saying “I don’t know” to your partner is a very powerful thing. It shows that you’re a human too! Keep accessibility in mind as well. There are tools and styles available to tailor pair programming sessions to the needs of you and your partner. For example, there are alternatives to verbal communication, like using a digital whiteboard or even sending messages over a communication platform. Invite people to be open about how they work best and support each other to create the right environment.

Remember That Silence Isn’t Always Golden

If it gets very quiet as you’re pairing together, it’s usually not a good sign. When you pair program with someone, communication is very important to both parties. Without it, it’s hard for one person to perceive the other person’s thoughts and feelings. Make a habit of explaining your thought process out loud as you work. If you need a moment to gather your thoughts, simply let your partner know instead of going silent on them without explanation.

Respect Each Other

Treat them like how you want to be treated, and value all opinions. Someone's opinions can help lead to an even greater solution. Everyone should be able to contribute and express their opinions.

Be Empathic Not Apathetic

If you’re pair programming remotely, displaying empathy goes a long way. As you’re pair programming with them, read the room. Do they feel flustered with you driving too fast? Are you aware of their emotional needs at the moment?

As you’re working together, listen attentively and provide them space to contribute and formulate opinions.

Mistakes Are Learning Opportunities

If you made a mistake while pair programming, don’t be embarrassed about it. Mistakes happen, and are actually opportunities to learn. If you notice your partner make a mistake, point it out politely—no need to make a big deal of it.

4. Communicate Well

A room with five happy people talking in a meeting
Communication is one of the most important aspects of pair programming.

Pair programming is all about communication. For two people to work together to build something, both need to be on the same page. Remote work can introduce unique communication challenges, since you don’t have the benefit of things like body language or gestures that come with being in the same physical room. Fortunately, there’s great tooling available, like Tuple, to solve these challenges and even enhance the pair programming experience. It’s a MacOS only application which allows people to pair program with each other remotely. Users can share their screen and either can take control to drive. The best part is that it’s a seamless experience without any additional UI taking up space on your screen.

During your session, use these tips to make sure you’re communicating with intention. 

Use Open-ended Questions

Open-ended questions lead to longer dialogues and provide a moment for someone to think critically. Even if they don’t know, it lets them learn something new that they will take away from the session. With closed questions, it’s usually a “yes” or a “no” only. Let’s say we’re working together on building a grouped React component of buttons, which one sounds more inviting for a discussion:

  • Is ButtonGroup a good name for the component? (Close-ended question)
  • What do you think of the name ButtonGroup? (Open-ended question)

Other examples of open-ended questions:

  • What are some approaches you took to solving this issue?
  • Before we try this approach, what do you think will happen?
  • What do you think this block of code is doing?

Give Positive Affirmations

Encouragement goes a long way, especially when folks are finding their footing early in their career. After all, knowing what’s gone right can be just as important as knowing what’s gone wrong. Throughout the session, pause and celebrate progress by noting when you see your partner do something well.

For example, you and your partner are building a new endpoint that’s part of a new feature your team is implementing. Instead of waiting until the feature is released, celebrate the small win.

Here are a few example messages you can give:

  • It looks like we marked everything off the agenda. Awesome session today.
  • Great work on catching the error. This is why I love pair programming because we work together as a team.
  • Huge win today! The PR we worked together is going to help not only our team, but others as well.

Communication Pitfalls to Avoid

No matter how it was intended, a rude or condescending comment made in passing can throw off the vibe of a pair programming session and quickly erode trust between partners. Remember that programming in front of someone can be a vulnerable experience, especially for someone just starting out. While it might seem obvious, we all need a reminder sometimes to be mindful of what we say. Here are some things to watch out for.

Passive-Aggressive (Or Just-Plain-Aggressive) Comments

Passive-aggressive behavior is when someone expresses anger or negative feelings in an indirect way. If you’re feeling frustrated during a session, avoid slipping into this behavior. Instead, communicate your feelings directly in a constructive way. When negative feelings are expressed in a hostile or outwardly rude way, this is aggressive behavior and should be avoided completely. 

Passive-aggressive behavior examples:

  • Giving your partner the silent treatment
  • Eye-rolling or sighing in frustration when your partner makes a mistake
  • Sarcastic comments like “Could you possibly code any slower?” 
  • Subtle digs like “Well…that’s an interesting idea.”

Aggressive behavior examples

  • “Typical intern mistake, rookie.”
  • “This is NOT how someone at your level should be working.”
  • “I thought you knew how to do this.”

Absolute Words Like Always and Never

Absolute words means you are 100 percent certain. In oral communication, depending on the use of the absolute word, it can sound condescending. Programming is also a world full of nuance, so overgeneralizing solutions as right or wrong is often misleading. Instead, use these scenarios as a teaching opportunity. If something is usually true, explain the scenarios where there might be exceptions. If a solution rarely works, explain when it might. Edge cases can open some really interesting and valuable conversations.

For example:

  • “You never write perfect code”
  • “I always review code”
  • “That would never work”

Use alternative terms instead:

  • For always, you can use usually
  • For never, you can use rarely

“When you join Shopify, I think it's overwhelming given the amount of resources to explore even for experienced people. Pair programming is the best way to gain context and learn. In this digital by design world, pair programming really helped me to connect with team members and gain context and learn how things work here in Shopify which helped me with faster onboarding.” -Nikita Acharya, Developer

I want to thank everyone at RenderATL who helped me finish this pair programming framework. If you’re early on in your career as a developer, pair programming is a great way to get to know your teammates and build your skills. And if you’re an experienced developer, I hope you’ll consider mentoring newer developers using pair programming. Either way, this framework should give you a starting point to give it a try.

In the pilot we’ve run so far with this framework, we’ve received positive feedback from our interns about how it allowed them to achieve their learning goals in a flexible format. We’re still experimenting and iterating with it, so we’d love to hear your feedback if you give it a try! Happy pairing!

Raymond Chung is helping to build a new generation of software developers through Shopify’s Dev Degree program. As a Technical Educator, his passion for computer science and education allows him to create bite-sized content that engages interns throughout their day-to-day. When he is not teaching, you’ll find Raymond exploring for the best bubble tea shop. You can follow Raymond on Twitter, GitHub, or LinkedIn.


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

Lessons From Building Android Widgets

Lessons From Building Android Widgets

By Matt Bowen, James Lockhart, Cecilia Hunka, and Carlos Pereira

When the new widget announcement was made for iOS 14, our iOS team went right to work designing an experience to leverage the new platform. However, widgets aren’t new to Android and have been around for over a decade. Shopify cares deeply about its mobile experience and for as long as we’ve had the Shopify mobile app, both our Android and iOS teams ship every feature one-to-one in unison. With the spotlight now on iOS 14, this was a perfect time to revisit our offering on Android.

Since our contribution was the same across both platforms, just like our iOS counterparts at the time, we knew merchants were using our widgets, but they needed more.

Why Widgets are Important to Shopify

Our widgets mainly focus on analytics that help merchants understand how they’re doing and gain insights to make better decisions quickly about their business. Monitoring metrics is a daily activity for a lot of our merchants, and on mobile, we have the opportunity to give merchants a faster way to access this data through widgets. They provide merchants a unique avenue to quickly get a pulse on their shops that isn’t available on the web.

Add Insights widget
Add Insights widget

After gathering feedback and continuously looking for opportunities to enhance our widget capabilities, we’re at our third iteration, and we’ll share with you the challenges we faced and how we solved them.

Why We Didn’t Use React Native

A couple of years ago Shopify decided to go full on React Native. New development should be done in React Native, and we’re also migrating some apps to the technology. This includes the flagship admin app, which is the companion app to the widgets.

Then why not write the widgets in React Native?

After doing some initial investigation, we quickly hit some roadblocks (like the fact that RemoteViews are the only way to create widgets). There’s currently no official support in the React Native community for RemoteViews, which is needed to support widgets. This felt very akin to a square peg in a round hole. Our iOS app also ran into issues supporting React Native, and we were running down the same path. Shopify believes in using the right tool for the job, we believe that native development was the right call in this case.

Building the Widgets

When building out our architecture for widgets, we wanted to create a consistent experience on both Android and iOS while preserving platform idioms where it made sense. In the sections below, we want to give you a view of our experiences building widgets. Pointing out some of the more difficult challenges we faced. Our aim is to shed some light around these less used surfaces, hopefully give some inspiration, and save time when it comes to implementing widgets in your applications.

Fetching Data

Some types of widgets have data that change less frequently (for example, reminders) and some that can be forecasted for the entire day (for example, calendar and weather). In our case, the merchants need up-to-date metrics about their business, so we need to show data as fresh as possible. Delays in data can cause confusion, or even worse, delay information that could change an action. Say you follow the stock market, you expect the stock app and widget data to be as up to date as possible. If the data is multiple hours stale, you may have missed something important! For our widgets to be valuable, we need information to be fresh while considering network usage.

Fetching Data in the App

Widgets can be kept up to date with relevant and timely information by using data available locally or fetching it from a server. The server fetching can be initiated by the widget itself or by the host app. In our case, since the app doesn’t need the same information the widget needed, we decided it would make more sense to fetch it from the widget. 

One benefit to how widgets are managed in the Android ecosystem over iOS is the flexibility. On iOS you have limited communication between the app and widget, whereas on Android there doesn’t seem to be the same restrictions. This becomes clear when we think about how we configure a widget. The widget configuration screen has access to all of the libraries and classes that our main app does. It’s no different than any other screen in the app. This is mostly true with the widget as well. We can access the resources contained in our main application, so we don’t need to duplicate any code. The only restrictions in a widget come with building views, which we’ll explore later.

When we save our configuration, we use shared preferences to persist data between the configuration screen and the widget. When a widget update is triggered, the shared preferences data for a given widget is used to build our request, and the results are displayed within the widget. We can read that data from anywhere in our app, allowing us to reuse this data in other parts of our app if desired.

Making the Widgets Antifragile

The widget architecture is built in a way that updates are mindful of battery usage, where updates are controlled by the system. In the same way, our widgets must also be mindful of saving bandwidth when fetching data over a network. While developing our second iteration, we came across a peculiar problem that was exacerbated by our specific use case. Since we need data to be fresh, we always pull new data from our backend on every update. Each update is approximately 15 minutes apart to avoid having our widgets stop updating. What we found is that widgets call their update method onUpdate(), more than once in an update cycle. In widgets like calendar, these extra calls come without much extra cost as the data is stored locally. However, in our app, this was triggering two to five extra network calls for the same widget with the same data in quick succession.

In order to correct the unnecessary roundtrips, we built a simple short-lived cache:

  1. The system asks our widget to provide new data from Reportify (Shopify’s data service)
  2. We first look into the local cache using the widgetID provided by the system.
  3. If there’s data, and that data was set less than one minute ago, we return it and avoid making a network request. We also take into account configuration such as locale so as not to avoid forcing updates after a language change.
  4. Otherwise, we fetch the data as normal and store it in the cache with the timestamp.
A flow diagram highlighting the steps of the cache
The simple short-lived cache flow

With this solution, we reduced unused network calls and system load and avoided collecting incorrect analytics.

Implementing Decoder Strategy with Dynamic Selections

We follow a similar approach as we have on iOS. We create a dynamic set of queries based on what the merchant has configured.

For each metric we have a corresponding definition implementation. This approach allows each metric the ability to have complete flexibility around what data it needs, and how it decodes the data from the response.

When Android asks us to update our widgets, we pull the merchants selection from our configuration object. Since each of the metric IDs has a definition, we map over them to create a dynamic set of queries.

We include an extension on our response object that binds the definitions to a decoder. Our service sends back an array of the response data corresponding to the queries made. We map over the original definitions, decoding each chunk to the expected return type.

Building the UI

Similar to iOS, we support three widget sizes for versions prior to Android 12 and follow the same rules for cell layout, except for the small widget. The small widget on Android supports a single metric (compared to the two on iOS) and the smallest widget size on Android is a 2x1 grid. We quickly found that only a single metric would fit in this space, so this design differs slightly between the platforms.

Unlike iOS with swift previews, we were limited to XML previews and running the widget on the emulator or device. We’re also building widgets dynamically, so even XML previews were relatively useless if we wanted to see an entire widget preview. Widgets are currently on the 2022 Jetpack Compose roadmap, so this is likely to change soon with Jetpack composable previews.

With the addition of dynamic layouts in Android 12, we created five additional sizes to support each size in between the original three. These new sizes are unique to Android. This also led to using grid sizes as part of our naming convention as you can see in our WidgetLayout enum below.

For the structure of our widget, we used an enum that acts as a blueprint to map the appropriate layout file to an area of our widget. This is particularly useful when we want to add a new widget because we simply need to add a new enum configuration.

To build the widgets dynamically, we read our configuration from shared preferences and provide that information to the RemoteViews API.

If you’re familiar with the RemoteViews API, you may notice the updateView() method, which is not a default RemoteViews method. We created this extension method as a result of an issue we ran into while building our widget layout in this dynamic manner. When a widget updates, the new remote views get appended to the existing ones. As you can probably guess, the widget didn’t look so great. Even worse, more remote views get appended on each subsequent update. We found that combining the two RemoteViews API methods removeAllViews() and addView() solved this problem.

Once we build our remote views, we then pass the parent remote view to the AppWidgetProvider updateAppWidget() method to display the desired layout.

It’s worth noting that we attempted to use partiallyUpdateAppWidget() to stop our remote views from appending to each other, but encountered the same issue.

Using Dynamic Dates

One important piece of information on our widget is the last updated timestamp. It helps remove confusion by allowing the merchants to quickly know how fresh the data is they are looking at. If the data is quite stale (say you went to the cottage for the weekend and missed a few updates) and there wasn’t a displayed timestamp, you would assume the data you’re looking at is up to the second. This can cause unnecessary confusion for our merchants. The solution here was to ensure there’s some communication to our merchant when the last update was made.

In our previous design, we only had small widgets, and they were able to display only one metric. This information resulted in a long piece of text that, on smaller devices, would sometimes wrap and show over two lines. This was fine when space was abundant in our older design but not in our new data rich designs. We explored how we could best work with timestamps on widgets, and the most promising solution was to use relative time. Instead of having a static value such as “as of 3:30pm” like our previous iteration. We would have a dynamic date that would look like: “1 min, 3 sec ago.”

One thing to remember is that even though the widget is visible, we have a limited number of updates we can trigger. Otherwise, it would be consuming a lot of unnecessary resources on the device. We knew we couldn’t keep triggering updates on the widget as often as we wanted. Android has a strategy for solving this with TextClock. However, there’s no support for relative time, which wouldn’t be useful in our use case. We also explored using Alarms, but potentially updating every minute would consume too much battery. 

One big takeaway we had from these explorations was to always test your widgets under different conditions. Especially low battery or poor network. These surfaces are much more restrictive than general applications and the OS is much more likely to ignore updates.

We eventually decided that we wouldn’t use relative time and kept the widget’s refresh time as a timestamp. This way we have full control over things like date formatting and styling.

Adding Configuration

Our new widgets have a great deal of configuration options, allowing our merchants to choose exactly what they care about. For each widget size, the merchant can select the store, a certain number of metrics and a date range. This is the only part of the widget that doesn’t use RemoteViews, so there aren’t any restrictions on what type of View you may want to use. We share information between the configuration and the widget via shared preferences.

Insights widget configuration
Insights widget configuration

Working with Charts and Images

Android widgets are limited to RemoteViews as their building blocks and are very restrictive in terms of the view types supported. If you need to support anything outside of basic text and images, you need to be a bit creative.

Our widgets support both a sparkline and spark bar chart built using the MPAndroidCharts library. We have these charts already configured and styled in our main application, so the reuse here was perfect; except, we can’t use any custom Views in our widgets. Luckily, this library is creating the charts via drawing to the canvas, and we simply export the charts as a bitmap to an image view. 

Once we were able to measure the widget, we constructed a chart of the required size, create a bitmap, and set it to a RemoteView.ImageView. One small thing to remember with this approach, is that if you want to have transparent backgrounds, you’ll have to use ARGB_8888 as the Bitmap Config. This simple bitmap to ImageView approach allowed us to handle any custom drawing we needed to do. 

Eliminating Flickering

One minor, but annoying issue we encountered throughout the duration of the project is what we like to call “widget flickering.” Flickering is a side-effect of the widget updating its data. Between updates, Android uses the initialLayout from the widget’s configuration as a placeholder while the widget fetches its data and builds its new RemoteViews. We found that it wasn’t possible to eliminate this behavior, so we implemented a couple of strategies to reduce the frequency and duration of the flicker.

The first strategy is used when a merchant first places a widget on the home screen. This is where we can reduce the frequency of flickering. In an earlier section “Making the Widgets Antifragile,” we shared our short-lived cache. The cache comes into play because the OS will trigger multiple updates for a widget as soon as it’s placed on the home screen. We’d sometimes see a quick three or four flickers, caused by updates of the widget. After the widget gets its data for the first time, we prevent any additional updates from happening for the first 60 seconds, reducing the frequency of flickering.

The second strategy reduces the duration of a flicker (or how long the initialLayout is displayed). We store the widgets configuration as part of shared preferences each time it’s updated. We always have a snapshot of what widget information is currently displayed. When the onUpdate() method is called, we invoke a renderEarlyFromCache() method as early as possible. The purpose of this method is to build the widget via shared preferences. We provide this cached widget as a placeholder until the new data has arrived.

Gathering Analytics

Largest Insights widget in light mode
Largest widget in light mode

Since our first widgets were developed, we added strategic analytics in key areas so that we could understand how merchants were using the functionality.  This allowed us to learn from the usage so we could improve on them. The data team built dashboards displaying detailed views of how many widgets were installed, the most popular metrics, and sizes.

Most of the data used to build the dashboards came from analytics events fired through the widgets and the Shopify app.

For these new widgets, we wanted to better understand adoption and retention of widgets, so we needed to capture how users are configuring their widgets over time and which ones are being added or removed.

Detecting Addition and Removal of Widgets 

Unlike iOS, capturing this data in Android is straight-forward. To capture when a merchant adds a widget, we send our analytical event when the configuration is saved. When removing a widget, the widgets built-in onDeleted method gives us the widget ID of the removed widget. We can then look up our widget information in shared preferences and send our event prior permanently deleting the widget information from the device. 

Supporting Android 12

When we started development of our new widgets, our application was targeting Android 11. We knew we’d be targeting Android 12 eventually, but we didn’t want the upgrade to block the release. We decided to implement Android 12 specific features once our application targeted the newer version, leading to an unforeseen issue during the upgrade process with widget selection.

Our approach to widget selection in previous versions was to display each available size as an option. With the introduction of responsive layouts, we no longer needed to display each size as its own option. Merchants can now pick a single widget and resize to their desired layout. In previous versions, merchants can select a small, medium, and large widget. In versions 12 and up, merchants can only select a single widget that can be resized to the same layouts as small, medium, large, and several other layouts that fall in between. This pattern follows what Google does with their large weather widget included on devices, as well as an example in their documentation. We disabled the medium and small widgets in Android 12 by adding a flag to our AndroidManifest and setting that value in attrs.xml for each version:

The approach above behaves as expected, the medium and small widgets were now unavailable from the picker. However, if a merchant was already on Android 12 and had added a medium or large widget before our widget update, those widgets were removed from their home screen. This could easily be seen as a bug and reduce confidence in the feature. In retrospect, we could have prototyped what the upgrade would have looked like to a merchant who was already on Android 12.

Allowing only the large widget to be available was a data-driven decision. By tracking widget usage at launch, we saw that the large widget was the most popular and removing the other two would have the least impact on current widget merchants.

Building New Layouts

We encountered an error when building the new layouts that fit between the original small, medium and large widgets.

After researching the error, we were exceeding the Binder transaction buffer. However, the buffer’s size is 1mb and the error displayed .66mb, which wasn’t exceeding the documented buffer size. The error has appeared to stump a lot of developers. After experimenting with ways to get the size down, we found that we could either drop a couple of entire layouts or remove support for a fourth and fifth row of the small metric. We decided on the latter, which is why our 2x3 widget only has three rows of data when it has room for five.

Rethinking the Configuration Screen

Now that we have one widget to choose from, we had to rethink what our configuration screen would look like to a merchant. Without our three fixed sizes, we could no longer display a fixed number of metrics in our configuration. 

Our only choice was to display the maximum number of metrics available for the largest size (which is seven at the time of this writing). Not only did this make the most sense to us from a UX perspective, but we also had to do it this way because of how the new responsive layouts work. Android has to know all of the possible layouts ahead of time. Even if a user shrinks their widget to a size that displays a single metric, Android has to know what the other six are, so it can be resized to our largest layout without any hiccups.

We also updated the description that’s displayed at the top of the configuration screen that explains this behavior.

Capturing More Analytics

On iOS, we capture analytical data when a merchant reconfigures a widget to gain insights into usage patterns. Reconfiguration for Android was only possible in version 12 and due to the limitations of the AppWidgetProvider’s onAppWidgetOptionsChanged() method, we were unable to capture this data on Android.

I’ll share more information about building our layouts in order to give context to our problem. Setting breakpoints for new dynamic layouts works very well across all devices. Google recommends creating a mapping of your breakpoints to the desired remote view layout. To build on a previous example where we showed the buildWidgetRemoteView() method, we used this method again as part of our breakpoint mapping. This approach allows us to reliably map our breakpoints to the desired widget layout.

When reconfiguring or resizing a widget, the onAppWidgetOptionsChanged() method is called. This is where we’d want to capture our analytical data about what had changed. Unfortunately,  this view mapping doesn’t exist here. We have access to width and height values that are measured in dp, initially appearing to be useful. At first, we felt that we could discern these measurements into something meaningful and map these values back to our layout sizes. After testing on a couple of devices, we realized that the approach was unreliable and would lead to a large volume of bad analytical data. Without confidently knowing what size we are coming from, or going to, we decided to omit this particular analytics event from Android. We hope to bring this to Google’s attention, and get it included in a future release.

Shipping New Widgets

Already having a pair of existing widgets, we had to decide how to transition to the new widgets as they would be replacing the existing implementation.

We didn’t find much documentation around migrating widgets. The docs only provided a way to enhance your widget, which means adding the new features of Android 12 to something you already had. This wasn’t applicable to us since our existing widgets were so different from the ones we were building.

The major issue that we couldn’t get around was related to the sizing strategies of our existing and new widgets. The existing widgets used a fixed width and height so they’d always be square. Our new widgets take whatever space is available. There wasn’t a way to guarantee that the new widget would fit in the available space that the existing widget had occupied. If the existing widget was the same size as the new one, it would have been worth exploring further.

The initial plan we had hoped for, was to make one of our widgets transform into the new widget while removing the other one. Given the above, this strategy would not work.

The compromise we came to, so as to not completely remove all of a merchant’s widgets overnight, was to deprecate the old widgets at the same time we release the new one. To deprecate, we updated our old widget’s UI to display a message informing that the widget is no longer supported and the merchant must add the new ones.

Screen displaying the notice: This widget is no longer active. Add the new Shopify Insights widget for an improved view of your data. Learn more.
Widget deprecation message

There’s no way to add a new widget programmatically or to bring the merchant to the widget picker by tapping on the old widget. We added some communication to help ease the transition by updating our help center docs, including information around how to use widgets, pointing our old widgets to open the help center docs, and just giving lots of time before removing the deprecation message. In the end, it wasn’t the most ideal situation and we came away learning about the pitfalls within the two ecosystems.

What’s Next

As we continue to learn about how merchants use our new generation of widgets and Android 12 features, we’ll continue to hone in on the best experience across both our platforms. This also opens the way for other teams at Shopify to build on what we’ve started and create more widgets to help Merchants. 

On the topic of mobile only platforms, this leads us into getting up to speed on WearOS. Our WatchOS app is about to get a refresh with the addition of Widgetkit; it feels like a great time to finally give our Android Merchants watch support too!

Matt Bowen is a mobile developer on the Core Admin Experience team. Located in West Warwick, RI. Personal hobbies include exercising and watching the Boston Celtics and New England Patriots.

James Lockhart is a Staff Developer based in Ottawa, Canada. Experiencing mobile development over the past 10+ years: from Phonegap to native iOS/Android and now React native. He is an avid baker and cook when not contributing to the Shopify admin app.

Cecilia Hunka is a developer on the Core Admin Experience team at Shopify. Based in Toronto, Canada, she loves live music and jigsaw puzzles.

Carlos Pereira is a Senior Developer based in Montreal, Canada, with more than a decade of experience building native iOS applications in Objective-C, Swift and now React Native. Currently contributing to the Shopify admin app.


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

Lessons From Building iOS Widgets

Lessons From Building iOS Widgets

By Carlos Pereira, James Lockhart, and Cecilia Hunka

When the iOS 14 beta was originally announced, we knew we needed to take advantage of the new widget features and get something to our merchants. The new widgets looked awesome and could really give merchants a way to see their shop’s data at a glance without needing to open our app.

Fast forward a couple of years, and we now have lots of feedback from the new design. We knew merchants were using them, but they needed more. The current design was lacking and only provided two metrics—also, they took up a lot of space. This experience prompted us to start a new project. To upgrade our original design to better fit our merchant’s needs.

Why Widgets Are Important to Shopify

Our widgets mainly focus on analytics. Analytics can help merchants understand how they’re doing and gain insights to make better decisions quickly about their business. Monitoring metrics is a daily activity for a lot of our merchants, and on mobile, we have the opportunity to give merchants a faster way to access this data through widgets. As widgets provide access to “at a glance” information about your shop and allow merchants a unique avenue to quickly get a pulse on their shops that they wouldn’t find on desktop.

A screenshot showing the add widget screen for Insights on iOS
Add Insights widget

After gathering feedback and continuously looking for opportunities to enhance our widget capabilities, we’re at our third iteration, and we’ll share with you how we approached building widgets and some of the challenges we faced.

Why We Didn’t Use React Native

A couple years ago Shopify decided to go all in on React Native. New development was done in React Native and we began migrating some apps to the new stack. Including our flagship admin app, where we were building our widgets. Which posed the question, should we write the widgets in React Native?

After doing some investigation we quickly hit some roadblocks: app extensions are limited in terms of memory, WidgetKit’s architecture is highly optimized to work with SwiftUI as the view hierarchy is serialized to disk, there’s also, at this time, no official support in the React Native community for widgets.

Shopify believes in using the right tool for the job, we believe that native development with SwiftUI was the best choice in this case.

Building the Widgets

When building out our architecture for widgets, we wanted to create a consistent experience on both iOS and Android while preserving platform idioms where it made sense. Below we’ll go over our experience and strategies building the widgets, pointing out some of the more difficult challenges we faced. Our aim is to shed some light around these less talked about surfaces, give some inspiration for your projects, and hopefully, save time when it comes to implementing your widgets.

Fetching Data

Some types of widgets have data that change less frequently (for example, reminders) and some that can be forecasted for the entire day (for example, Calendar and weather). In our case, the merchants need up-to-date metrics about their business, so we need to show data as fresh as possible. Time for our widget is crucial. Delays in data can cause confusion, or even worse, delay information that could inform a business decision. For example, let’s say you watch the stocks app. You would expect the stock app and its corresponding widget data to be as up to date as possible. If the data is multiple hours stale, you could miss valuable information for making decisions or you could miss an important drop or rise in price. With our product, our merchants need as up to date information as we can provide them to run their business.

Fetching Data in the App

Widgets can be kept up to date with relevant and timely information by using data available locally or fetching it from a server. The server fetching can be initiated by the widget itself or by the host app. In our case, since the app doesn’t share the same information as the widget, we decided it made more sense to fetch it from the widget.

We still consider moving data fetching to the app once we start sharing similar data between widgets and the app. This architecture could simplify the handling of authentication, state management, updating data, and caching in our widget since only one process will have this job rather than two separate processes. It’s worth noting that the widget can access code from the main app, but they can only communicate data through keychain and shared user defaults as widgets run on separate processes. Sharing the data fetching, however, comes with an added complexity of having a background process pushing or making data available to the widgets, since widgets must remain functional even if the app isn’t in the foreground or background. For now, we’re happy with the current solution: the widgets fetch data independently from the app while sharing the session management code and tokens.

A flow diagram highlighting widgets fetch data independently from the app while sharing the session management code and tokens
Current solution where widgets fetch data independently

Querying Business Analytics Data with Reportify and ShopifyQL

The business data and visualizations displayed in the widgets are powered by Reportify, an in-house service that exposes data through a set of schemas queried via ShopifyQL, Shopify’s Commerce data querying language. It looks very similar to SQL but is designed around data for commerce. For example, to fetch a shops total sales for the day:

Making Our Widgets Antifragile

iOS Widgets' architecture is built in a way that updates are mindful of battery usage and are budgeted by the system. In the same way, our widgets must also be mindful of saving bandwidth when fetching data over a network. While developing our second iteration we came across a peculiar problem that was exacerbated by our specific use case.

Since we need data to be fresh, we always pull new data from our backend on every update. Each update is approximately 15 minutes apart to avoid having our widgets stop updating (which you can read about why on Apple’s Developer site). We found that iOS calls the update methods, getTimeline() and getSnapshot(), more than once in an update cycle. In widgets like calendar, these extra calls come without much extra cost as the data is stored locally. However, in our app, this was triggering two to five extra network calls for the same widget with the same data in quick succession.

We also noticed these calls were causing a seemingly unrelated kick out issue affecting the app. Each widget runs on a different process than the main application, and all widgets share the keychain. Once the app requests data from the API, it checks to see if it has an authenticated token in the keychain. If that token is stale, our system pauses updates, refreshes the token, and continues network requests. In the case of our widgets, each widget call to update was creating another workflow that could need a token refresh. When we only had a single widget or update flow, it worked great! Even four to five updates would usually work pretty well. However, eventually one of these network calls would come out of order and an invalid token would get saved. On our next update, we have no way to retrieve data or request a new token resulting in a session kick out. This was a great find as it was causing a lot of frustration for our affected merchants and ourselves, who could never really put our finger on why these things would, every now and again, just log us out.

In order to correct the unnecessary roundtrips, we built a simple short-lived cache:

  1. The system asks our widget to provide new data
  2. We first look into the local cache using a key specific to that widget. On iOS, our key is produced from a configuration for that widget as there’s no unique identifiers provided. We also take into account configuration such as locale so as not to avoid forcing updates after a language change.
  3. If there’s data, and that data was set less than one minute ago, we return it and avoid making a network request. 
  4. Otherwise, we fetch the data as normal and store it in the cache with the timestamp.
      A flow diagram highlighting the steps of the cache
      The simple short-lived cache flow

      With this solution, we reduced unused network calls and system load, avoided collecting incorrect analytics, and fixed a long running bug with our app kick outs!

      Implementing Decoder Strategy with Dynamic Selections

      When fetching the data from the Analytics REST service, each widget can be configured with two to seven metrics from a total of 12. This set should grow in the future as new metrics are available too! Our current set of metrics are all time-based and have a similar structure.

      But that doesn’t mean the structure of future metrics will not change. For example, what about a metric that contains data that isn’t mapped over a time range? (like orders to fulfill, which does not contain any historical information).

      The merchant is also able to configure the order the metrics appear, which shop (if they have more than one shop), and which date range represents the data: today, last 7 days, and last 30 days.

      We had to implement a data fetching and decoding mechanism that:

      • only fetches the data the merchant requested in order to avoid asking for unneeded information
      • supports a set of metrics as well as being flexible to add future metrics with different shapes
      • supports different date ranges for the data.

      A simplified version of the solution is shown below. First, we create a struct to represent the query to the analytics service (Reportify).

      Then, we create a class to represent the decodable response. Right now it has a fixed structure (value, comparison, and chart values), but in the future we can use an enum or different subclasses to decode different shapes.

      Next, we create a response wrapper that attempts to decode the metrics based on a list of metric types passed to it. Each metric has its configuration, so we know which class is used to read the values.

      Finally, when the widget Timeline Provider asks for new data, we fetch the data from the current metrics and decode the response. 

      Building the UI

      We wanted to support the three widget sizes: small, medium, and large. From the start we wanted to have a single View to support all sizes as an attempt to minimize UI discrepancies and make the code easy to maintain.

      We started by identifying the common structure and creating components. We ended up with a Metric Cell component that has three variations:

      A metric cell from a widget
      A metric cell
      A metric cell from a widget
      A metric cell with a sparkline
      A metric cell from a widget
      A metric cell with barchart

      All three variations consist of a metric name and value, chart, and a comparison. As the widget containers become bigger, we show the merchant more data. Each view size contains more metrics, and the largest widget contains a full width chart on the first chosen metric. The comparison indicator also gets shifted from bottom to right on this variation.

      The first chosen metric, on the large widget, is shown as a full width cell with a bar chart showing the data more clearly; we call it the Primary cell. We added a structure to indicate if a cell is going to be used as primary or not. Besides the primary flag, our component doesn’t have any context about the widget size, so we use chart data as an indicator to render a cell primary or not. This paradigm fits very well with SwiftUI.

      A simplified version of the actual Cell View:

      After building our cells, we need to create a structure to render them in a grid according to the size and metrics chosen by the merchant. This component also has no context of the widget size, so our layout decisions are mainly based on how many metrics we are receiving. In this example, we’ll refer to the View as a WidgetView.

      The WidgetView is initialized with a WidgetState, a struct that holds most of the widget data such as shop information, the chosen metrics and their data, and a last updated string (which represents the last time the widget was updated).

      To be able to make decisions on layout based on the widget characteristics, we created an OptionSet called LayoutOption. This is passed as an array to the WidgetView.

      Layout options:

      That helped us not to tie this component to Widget families, rather to layout characteristics that makes this component very reusable in other contexts.

      The WidgetView layout is built using mainly a LazyVGrid component:

      A simplified version of the actual View:

      Adding Dynamic Dates

      One important piece of information on our widget is the last updated timestamp. It helps remove confusion by allowing merchants too quickly know how fresh the data is they’re looking at. Since iOS has an approximate update time with many variables, coupled with data connectivity, it’s very possible the data could be over 15 minutes old. If the data is quite stale (say you went to the cottage for the weekend and missed a few updates) and there was no update string, you would assume the data you’re looking at is up to the second. This can cause unnecessary confusion for our merchants. The solution here was to ensure there’s some communication to our merchant when the last update was.

      In our previous design, we only had small widgets, and they were able to display only one metric. This information resulted in a long string, that on smaller devices, would sometimes wrap and show over two lines. This was fine when space was abundant in our older design but not in our new data rich designs. We explored how we could best work with timestamps on widgets, and the most promising solution was to use relative time. Instead of having a static value such as “as of 3:30pm” like our previous iteration, we would have a dynamic date that would look like: “1 min, 3 sec ago.”

      One thing to remember is that even though the widget is visible, we have a limited number of updates we can trigger. Otherwise, it would be consuming a lot of unnecessary resources on the merchant’s device. We knew we couldn’t keep triggering updates on the widget as often as we wanted (nor would it be allowed), but iOS has ways to deal with this. Apple did release support for dynamic text on widgets during our development that allowed using timers on your widgets without requiring updates. We simply need to pass a style to a Text component and it automatically keeps everything up to date:

      Text("\(now, style: .relative) ago")

      It was good, but we have no options to customize the relative style. Being able to customize the relative style was an important point for us, as the current supported style does not fit well with our widget layout. One of our biggest constraints with widgets is space as we always need to think about the smallest widget possible. In the end we decided not to move forward with the relative time approach, and kept a reduced version of our previous timestamp.

      Adding Configuration

      Our new widgets have a great amount of configuration, allowing for merchants to choose exactly what they care about. For each widget size, the merchant can select the store, a certain number of metrics, and a date range. On iOS, widgets are configured through the SiriKit Intents API. We faced some challenges with the WidgetConfiguration, but fortunately, all had workarounds that fit our use cases.

      Insights widget configuration
      Insights widget configuration

      It’s Not Possible to Deselect a Metric

      When defining a field that has multiple values provided dynamically, we can limit the number of options per widget family. This was important for us, since each widget size has a different number of metrics it can support. However, the current UI on iOS for widget configuration only allows selecting a value but not deselecting it. So, once we selected a metric we couldn’t remove it, only update the selection. But what if the merchant were only interested in one metric on the small widget? We solved this with a small design change, by providing “None” as an option. If the merchant were to choose this option, it would be ignored and shown as an empty state. 

      It's not possible to validate the user selections

      With the addition of “None” and the way intents are designed, it was possible to select all “None” and have a widget with no metrics. In addition, it was possible to select the same metric twice.. We would like to be able to validate the user selection, but the Intents API didn't support it. The solution was to embrace the fact that a widget can be empty and show as an empty state. Duplicates were filtered out so any more than a single metric choice was changed to “None” before we sent any network requests.

      The First Calls to getTimeline and getSnapshot Don’t Respect the Maximum Metric Count

      For intent configurations provided dynamically, we must provide default values in the IntentHandler. In our case, the metrics list varies per widget family. In the IntentHandler, it’s not possible to query which widget family is being used. So we had to return at least as many metrics as the largest widget (seven). 

      However, even if we limit the number of metrics per family, the first getTimeline and getSnapshot calls in the Timeline Provider were filling the configuration object with all default metrics, so a small widget would have seven metrics instead of two!

      We ended up adding some cleanup code in the beginning of the Timeline Provider methods that trims the list depending on the expected number of metrics.

      Optimizing Testing

      Automated tests are a fundamental part of Shopify’s development process. In the Shopify app, we have a good amount of unit and snapshot tests. The old widgets on Android had good test coverage already, and we built on the existing infrastructure. On iOS, however, there were no tests since it’s currently not possible to add test targets against a widget extension on Xcode.

      Given this would be a complex project and we didn’t want to compromise on quality, we investigated possible solutions for it.

      The simplest solution would be to add each file on both the app and in the widget extension targets, then we could unit test it in the app side in our standard test target. We decided not to do this since we would always need to add a file to both targets, and it would bloat the Shopify app unnecessarily.

      We chose to create a separate module (a framework in our case) and move all testable code there. Then we could create unit and snapshot tests for this module.

      We ended up moving most of the code, like views and business logic, to this new module (WidgetCore), while the extension only had WidgetKit specific code and configuration like Timeline provider, widget bundle, and intent definition generated files.

      Given our code in the Shopify app is based on UIKit, we did have to update our in-house snapshot testing framework to support SwiftUI views. We were very happy with the results. We ended up achieving a high test coverage, and the tests flagged many regressions during development.

      Fast SwiftUI Previews 

      The Shopify app is a big application, and it takes a while to build. Given the widget extension is based on our main app target, it took a long time to prepare the SwiftUI previews. This caused frustration during development. It also removed one of the biggest benefits of SwiftUI—our ability to iterate quickly with Previews and the fast feedback cycle during UI development.

      One idea we had was to create a module that didn’t rely on our main app target. We created one called WidgetCore where we put a lot of our reusable Views and business logic. It was fast to build and could also render SwiftUI previews. The one caveat is, since it wasn’t a widget extension target, we couldn’t leverage the WidgetPreviewContext API to render views on a device. It meant we needed to load up the extension to ensure the designs and changes were always working as expected on all sizes and modes (light and dark).

      To solve this problem, we created a PreviewLayout extension. This had all the widget sizes based on the Apple documentation, and we were able to use it in a similar way:

      Our PreviewLayout extension would be used on all of our widget related views in our WidgetCore module to emulate the sizes in previews:

      Acquiring Analytics

      Shopify Insights widget largest size in light mode
      Largest widget in light mode

      Since our first widgets were developed, we wanted to understand how merchants are using the functionality, so we can always improve it. The data team built some dashboards showing things like a detailed view of how many widgets installed, the most popular metrics, and sizes.

      Most of the data used to build the dashboards come from analytics events fired through the widgets and the Shopify app.

      For the new widgets, we wanted to better understand adoption and retention of widgets, so we needed to capture how users are configuring their widgets over time and which ones are being added or removed.

      Managing Unique Ids

      WidgetKit has the WidgetCenter struct that allows requesting information about the widgets currently configured in the device through the getCurrentConfigurations method. However, the list of metadata returned (WidgetInfo) doesn’t have a stable unique identifier. Its identifier is the object itself, since it’s hashable. Given this constraint, if two identical widgets are added, they’ll both have the same identifier. Also, given the intent configuration is part of the id, if something changes (for example, date range) it’ll look like it’s a totally different widget.

      Given this limitation, we had to adjust the way we calculate the number of unique widgets. It also made it harder to distinguish between different life-cycle events (adding, removing, and configuring). Hopefully there will be a way to get unique ids for widgets in future versions of iOS. For now we created a single value derived from the most important parts of the widget configuration.

      Detecting, Adding, and Removing Widgets 

      Currently there’s no WidgetKit life cycle method that tells us when a widget was added, configured, or removed. We needed it so we can better understand how widgets are being used.

      After some exploration, we noticed that the only methods we could count on were getTimeline and getSnapshot. We then decided to build something that could simulate these missing life cycle methods by using the ones we had available. getSnapshot is usually called on state transitions and also on the widget Gallery, so we discarded it as an option.

      We built a solution that did the following

      1. Every time the Timeline providers’ getTimeline is called, we call WidgetKit’s getCurrentConfigurations to see what are the current widgets installed.
      2. We then compare this list with a previous snapshot we persist on disk.
      3. Based on this comparison we try to guess which widgets were added and removed.
      4. Then we triggered the proper life cycle methods: didAddWidgets(), didRemoveWidgets().

      Due to identifiers not being stable, we couldn’t find a reliable approach to detect configuration changes, so we ended up not supporting it.

      We also noticed that WidgetKit.getCurrentConfigurations’s results can have some delay. If we remove a widget, it may take a couple getTimeline calls for it to be reflected. We adjusted our analytics scheme to take that into account.

      Detecting, adding, and removing widgets solution
      Current solution

      Supporting iOS 16

      Our approach to widgets made supporting iOS 16 out of the gate really simple with a few changes. Since our lock screen complications will surface the same information as our home screen widgets, we can actually reuse the Intent configuration, Timeline Provider, and most of the views! The only change we need to make is to adjust the supported families to include .accessoryInline, .accessoryCircular, and .accessoryRectangular, and, of course, draw those views.

      Our Main View would also just need a slight adjustment to work with our existing home screen widgets.

      Migrating Gracefully

      WidgetKit was introduced for watchOS complications in iOS 16. This update comes with a foreboding message from Apple:

      Important
      As soon as you offer a widget-based complication, the system stops calling ClockKit APIs. For example, it no longer calls your CLKComplicationDataSource object’s methods to request timeline entries. The system may still wake your data source for migration requests.

      We really care about our apps at Shopify, so we really needed to unpack what this meant, and how does this affect our merchants running older devices? With some testing on devices, we were able to find out, everything is fine.

      If you’re currently running WidgetKit complications and add support for lock screen complications, your ClockKit app and complications will continue to function as you’d expect.

      What we had assumed was that WidgetKit itself was taking the place of WatchOS complications; however, to use Widgetkit on WatchOS, you need to create a new target for the Watch. This makes sense, although the APIs are so similar we had assumed it was a one and done approach. One WidgetKit extension for both platforms.

      One thing to watch out for,  if you do implement the new WidgetKit on WatchOS, if your users are on WatchOS 9 and above will lose all of their complications from ClockKit. Apple did provide a migration API to support the change that’s called instead of your old complications.

      If you don’t have the luxury of just setting your target to iOS 16, your complications will continue to load up for those on WatchOS 8 and below from our testing.

      Shipping New Widgets

      We already had a set of widgets running on both platforms, now we had to decide how to transition to the new update as they would be replacing the existing implementation. On iOS we had two different widget kinds each with their own small widget (you can think of kinds as a widget group). With the new implementation, we wanted to provide a single widget kind that offered all three sizes. We didn’t find much documentation around the migration, so we simulated what happens to the widgets under different scenarios.

      If the merchant has a widget on their home screen and the app updates, one of two things would happen:

      1. The widget would become a white blank square (the kind IDs matched).
      2. The widget just disappeared altogether (the kind ID was changed).

      The initial plan (we had hoped for) was to make one of our widgets transform into the new widget while removing the other one. Given the above, this strategy wouldn’t work. This also includes some annoying tech debt since all of our Intent files would continue to mention the name of the old widget.

      The compromise we came to, so as to not completely remove all of a merchant’s widgets overnight, was to deprecate the old widgets at the same time we release the new ones. To deprecate, we updated our old widget’s UI to display a message informing that the widget is no longer supported, and the merchant must add the new ones. The lesson here is you have to be careful when you make decisions around widget grouping as it’s not easy to change.

      There’s no way to add a new widget programmatically or to bring the merchant to the widget gallery by tapping on the old widget. We also added some communication to help ease the transition by:

        • updating our help center docs, including information around how to use widgets 
        • pointing our old widgets to open the help center docs 
        • giving lots of time before removing the deprecation message.

        In the end, it wasn’t the most ideal situation, and we came away learning about the pitfalls within the two ecosystems. One piece of advice is to really reflect on current and future needs when defining which widgets to offer and how to split them, since a future modification may not be straightforward.

        Screen displaying the notice: This widget is no longer active. Add the new Shopify Insights widget for an improved view of your data. Learn more.
        Widget deprecation message

        What’s Next

        As we continue to learn about how merchants use our new generation of widgets, we’ll continue to hone in on the best experience across both our platforms. Our widgets were made to be flexible, and we’ll be able to continually grow the list of metrics we offer through our customization. This work opens the way for other teams at Shopify to build on what we’ve started and create more widgets to help Merchants too.

        2022 is a busy year with iOS 16 coming out. We’ve got a new WidgetKit experience to integrate to our watch complications, lock screen complications, and live activities hopefully later this year!

        Carlos Pereira is a Senior Developer based in Montreal, Canada, with more than a decade of experience building native iOS applications in Objective-C, Swift and now React Native. Currently contributing to the Shopify admin app.

        James Lockhart is a Staff Developer based in Ottawa, Canada. Experiencing mobile development over the past 10+ years: from Phonegap to native iOS/Android and now React native. He is an avid baker and cook when not contributing to the Shopify admin app.

        Cecilia Hunka is a developer on the Core Admin Experience team at Shopify. Based in Toronto, Canada, she loves live music and jigsaw puzzles.


        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        What is a Full Stack Data Scientist?

        What is a Full Stack Data Scientist?

        At Shopify, we've embraced the idea of full stack data science and are often asked, "What does it mean to be a full stack data scientist?". The term has seen a recent surge in the data industry, but there doesn’t seem to be a consensus on a definition. So, we chatted with a few Shopify data scientists to share our definition and experience.

        What is a Full Stack Data Scientist?

        "Full stack data scientists engage in all stages of the data science lifecycle. While you obviously can’t be a master of everything, full stack data scientists deliver high-impact, relatively quickly because they’re connected to each step in the process and design of what they’re building." - Siphu Langeni, Data Scientist

        "Full stack data science can be summed up by one word—ownership. As a data scientist you own a project end-to-end. You don't need to be an expert in every method, but you need to be familiar with what’s out there. This helps you identify what’s the best solution for what you’re solving for." - Yizhar (Izzy) Toren, Senior Data Scientist

        Typically, data science teams are organized to have different data scientists work on singular aspects of a data science project. However, a full stack data scientist’s scope covers a data science project from end-to-end, including:

        • Discovery and analysis: How you collect, study, and interpret data from a number of different sources. This stage includes identifying business problems.
        • Acquisition: Moving data from diverse sources into your data warehouse.
        • Data modeling: The process for transforming data using batch, streaming, and machine learning tools.

        What Skills Make a Successful Full Stack Data Scientist? 

        "Typically the problems you're solving for, you’re understanding them as you're solving them. That’s why you need to be constantly communicating with your stakeholders and asking questions. You also need good engineering practices. Not only are you responsible for identifying a solution, you also need to build the pipeline to ship that solution into production." - Yizhar (Izzy) Toren, Senior Data Scientist

        "The most effective full stack data scientists don't just wait for ad hoc requests. Instead, they proactively propose solutions to business problems using data. To effectively do this, you need to get comfortable with detailed product analytics and developing an understanding of how your solution will be delivered to your users." - Sebastian Perez Saaibi, Senior Data Science Manager

        Full stack data scientists are generalists versus specialists. As full stack data scientists own projects from end-to-end, they work with multiple stakeholders and teams, developing a wide range of both technical and business skills, including:

        • Business acumen: Full stack data scientists need to be able to identify business problems, and then ask the right questions in order to build the right solution.
        • Communication: Good communication—or data storytelling—is a crucial skill for a full stack data scientist who typically helps influence decisions. You need to be able to effectively communicate your findings in a way that your stakeholders will understand and implement. 
        • Programming: Efficient programming skills in a language like Python and SQL are essential for shipping your code to production.
        • Data analysis and exploration: Exploratory data analysis skills are a critical tool for every full stack data scientist, and the results help answer important business questions.
        • Machine learning: Machine learning is one of many tools a full stack data scientist can use to answer a business question or solve a problem, though it shouldn’t be the default. At Shopify, we’re proponents of starting simple, then iterating with complexity.  

        What’s the Benefit of Being a Full Stack Data Scientist? 

        “You get to choose how you want to solve different problems. We don't have one way of doing things because it really depends on what the problem you’re solving for is. This can even include deciding which tooling to use.”- Yizhar (Izzy) toren, Senior Data Scientist

        “You get maximum exposure to various parts of the tech stack, develop a confidence in collaborating with other crafts, and become astute in driving decision-making through actionable insights.” - Siphu Langeni, Data Scientist

        As a generalist, is a full stack data scientist a “master of none”? While full stack data scientists are expected to have a breadth of experience across the data science specialty, each will also bring additional expertise in a specific area. At Shopify, we encourage T-shaped development. Emphasizing this type of development not only enables our data scientists to hone skills they excel at, but it also empowers us to work broadly as a team, leveraging the depth of individuals to solve complex challenges that require multiple skill sets. 

        What Tips Do You Have for Someone Looking to Become a Full Stack Data Scientist? 

        “Full stack data science might be intimidating, especially for folks coming from academic backgrounds. If you've spent a career researching and focusing on building probabilistic programming models, you might be hesitant to go to different parts of the stack. My advice to folks taking the leap is to treat it as a new problem domain. You've already mastered one (or multiple) specialized skills, so look at embracing the breadth of full stack data science as a challenge in itself.” - Sebastian Perez Saaibi, Senior Data Science Manager

        “Ask lots of questions and invest effort into gathering context that could save you time on the backend. And commit to honing your technical skills; you gain trust in others when you know your stuff!” - Siphu Langeni, Data Scientist

        To sum it up, a full stack data scientist is a data scientist who:

        • Focuses on solving business problems
        • Is an owner that’s invested in an end-to-end solution, from identifying the business problem to shipping the solution to production
        • Develops a breadth of skills that cover the full stack of data science, while building out T-shaped skills
        • Knows which tool and technique to use, and when

        If you’re interested in tackling challenges as a full stack data scientist, check out Shopify’s career page.

        Continue reading

        Managing React Form State Using the React-Form Library

        Managing React Form State Using the React-Form Library

        One of Shopify’s philosophies when it comes to adopting a new technology isn’t only to level up the proficiency of our developers so they can implement the technology at scale, but also with the intent of sharing their new found knowledge and understanding of the tech with the developer community.

        In Part 1 (Building a Form with Polaris) of this series, we were introduced to Shopify’s Polaris Design System, an open source library used to develop the UI within our Admin and here in Part 2 we’ll delve further into Shopify’s open source Quilt repo that contains 72 npm packages, one of which is the react-form library. Each package was created to facilitate the adoption and standardization of React and each has its own README and thorough documentation to help get you started.

        The react-form Library

        If we take a look at the react-form library repo we can see that it’s used to:

        “Manage React forms tersely and safely-typed with no effort using React hooks. Build up your form logic by combining hooks yourself, or take advantage of the smart defaults provided by the powerful useForm hook.”

        The useForm and useField Hooks

        The documentation categorizes the API into three main sections: Hooks, Validation, and Utilities. There are eight hooks in total and for this tutorial we’ll focus our attention on just the ones most frequently used: useForm and useField.

        useForm is a custom hook for managing the state of an entire form and makes use of many of the other hooks in the API. Once instantiated, it returns an object with all of the fields you need to manage a form. When combined with useField, it allows you to easily build forms with smart defaults for common use cases. useField is a custom hook for handling the state and validations of an input field.

        The Starter CodeSandbox

        As this tutorial is meant to be a step-by-step guide providing all relevant code snippets along the way, we highly encourage you to fork this Starter CodeSandbox so you can code along throughout the tutorial.

        If you hit any roadblocks along the way, or just prefer to jump right into the solution code, here’s the Solution CodeSandbox.

        First Things First—Clean Up Old Code

        The useForm hook creates and manages the state of the entire form which means we no longer need to import or write a single line of useState in our component, nor do we need any of the previous handler functions used to update the input values. We still need to manage the onSubmit event as the form needs instructions as to where to send the captured input but the handler itself is imported from the useForm hook.

        With that in mind let’s remove all the following previous state and handler logic from our form.

        React isn’t very happy at the moment and presents us with the following error regarding the handleTitleChange function not being defined.

        ReferenceError
        handleTitleChange is not defined

        This occurs because both TextField Components are still referencing their corresponding handler functions that no longer exist. For the time being, we’ll remove both onChange events along with the value prop for both Components.

        Although we’re removing them at the moment, they’re still required as per our form logic and will be replaced by the fields object provided by useForm.

        React still isn’t happy and presents us with the following error in regards to the Page Component assigning onAction to the handleSubmit function that’s been removed.

        ReferenceError
        handleSubmit is not defined

        It just so happens that the useForm hook provides a submit function that does the exact same thing, which we’ll destructure in the next section. For the time being we’ll assign submit to onAction and place it in quotes so that it doesn’t throw an error.

        One last and final act of cleanup is to remove the import for useState, at the top of the file, as we’ll no longer manage state directly.

        Our codebase now looks like the following:

        Importing and Using the useForm and useField Hooks

        Now that our Form has been cleaned up, let’s go ahead and import both the useForm and useField hooks from react-form. Note, for this tutorial the shopify/react-form library has already been installed as a dependency.

        import { useForm, useField } from "@shopify/react-form”;

        If we take a look at the first example of useForm in the documentation, we can see that useForm provides us quite a bit of functionality in a small package. This includes several properties and methods that can be instantiated, in addition to accepting a configuration object that’s used to define the form fields and an onSubmit function.

        In order to keep ourselves focused on the basic functionality, we start with only capturing the inputs for our two fields, title and description, and then handle the submission of the form.  We’ll also pass in the configuration object, assigning useField() to each field, and lastly, an onSubmit function.

        Since we previously removed the value and onChange props from our TextField components the inputs no longer capture nor display text. They both worked in conjunction where onChange updated state, allowing value to display the captured input once the component re-rendered. The same functionality is still required but those props are now found in the fields object, which we can easily confirm by adding a console.log and viewing the output:

        If we do a bit more investigation and expand the description key, we see all of its additional properties and methods, two of which are onChange and value.

        With that in mind, let’s add the following to our TextField components:

        It’s clear from the code we just added that we’re destructuring the fields object and assigning the key that corresponds to the input’s label field. We should also be able to type into the inputs and see the text updated. The field object also contains additional properties and methods such as reset and dirty that we’ll make use of later when we connect our submit function.

        Submitting the Form

        With our TextField components all set up, it’s time to enable the form to be submitted. As part of the previous clean up process, we updated the Page Components onAction prop and now it’s time to remove the quotes.

        Now that we’ve enabled submission of the form, let’s confirm that the onSubmit function works and take a peek at the fields object by adding a console log.

        Let’s add a title and description to our new product and click Save.

        Adding a Product Screen with a Title and Description fields
        Adding A Product

        We see the following output:

        More Than Just Submitting

        When we reviewed the useForm documentation earlier we made note of all the additional functionality that it provides, two of which we will make use of now: reset and dirty.

        Reset the Form After Submission

        reset is a method and is used to clear the form, providing the user with a clean slate to add additional products once the previous one has been saved. reset should be called only after the fields have been passed to the backend and the data has been handled appropriately, but also before the return statement.

        If you input some text and click Save, our form should clear the input fields as expected.

        Conditionally Enable The Save Button

        dirty is used to disable the Save button until the user has typed some text into either of the input fields. The Page component manages the Save button and has a disabled property that we assign the value of !dirty because its value is set to false when imported, so we need to change that to true. 

        You should now notice that the Save button is disabled until you type into either of the fields, at which point Save is enabled 

        We can also validate that it’s now disabled by examining the Save button in developer tools.

        Developer Tools screenshot showing the code disabling the save button.
        Save Button Disabled

        Form Validation

        What we might have noticed when adding dirty, is that if the user types into either field the Save button is immediately enabled. One last aspect of our form is that we’ll require the Title field to contain some input before being allowed to submit the product. To do this we’ll import the notEmpty hook from react-form.

        Assigning it also requires that we now pass useField a configuration object that includes the following keys: value and validates. The value key keeps track of the current input value and validates provides us a means of validating input based on some criteria.

        In our case, we’ll prevent the form from being submitted if the title field is empty and provide the user an error message indicating that it’s a required field.

        Let’s give it a test run and confirm it’s working as expected. Try adding only a description and then click Save.

        Add Product screen with Title and Description fields. Title field is empty and showing error message “field is required.”
        Field is required error message

        As we can see our form implements all the previous functionality and then some, all of which was done via the useForm and useField hooks. There’s quite a bit more functionality that these specific hooks provide, so I encourage you to take a deeper dive and explore them further.

        This tutorial was meant to introduce you to Shopify’s open source react-form library that’s available in Shopify’s public Quilt repo. The repo provides many more useful React hooks such as the following to name a few:

        • react-graphql: for creating type-safe and asynchronous GraphQL components for React
        • react-testing:  for testing React components according to our conventions
        • react-i18n: which includes i18n utilities for handling translations and formatting.

        Joe Keohan is a Front End Developer at Shopify, located in Staten Island, NY and working on the Self Serve Hub team enhancing the buyer’s experience. He’s been instructing software engineering bootcamps for the past 6 years and enjoys teaching others about software development. When he’s not working through an algorithm, you’ll find him jogging, surfing and spending quality time with his family.


        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        Leveraging Go Worker Pools to Scale Server-side Data Sharing

        Leveraging Go Worker Pools to Scale Server-side Data Sharing

        Maintaining a service that processes over a billion events per day poses many learning opportunities. In an ecommerce landscape, our service needs to be ready to handle global traffic, flash sales, holiday seasons, failovers and the like. Within this blog, I'll detail how I scaled our Server Pixels service to increase its event processing performance by 170%, from 7.75 thousand events per second per pod to 21 thousand events per second per pod.

        But First, What is a Server Pixel?

        Capturing a Shopify customer’s journey is critical to contributing insights into marketing efforts of Shopify merchants. Similar to Web Pixels, Server Pixels grants Shopify merchants the freedom to activate and understand their customers’ behavioral data by sending this structured data from storefronts to marketing partners. However, unlike the Web Pixel service, Server Pixels sends these events through the server rather than client-side. This server-side data sharing is proven to be more reliable allowing for better control and observability of outgoing data to our partners’ servers. The merchant benefits from this as they are able to drive more sales at a lower cost of acquisition (CAC). With regional opt-out privacy regulations built into our service, only customers who have allowed tracking will have their events processed and sent to partners.  Key events in a customer’s journey on a storefront are captured such as checkout completion, search submissions and product views. Server Pixels is a service written in Golang which validates, processes, augments, and consequently, produces more than one billion customer events per day. However, with the management of such a large number of events, problems of scale start to emerge.

        The Problem

        Server Pixels leverages Kafka infrastructure to manage the consumption and production of events. We began to have a problem with our scale when an increase in customer events triggered an increase in consumption lag for our Kafka’s input topic. Our service was susceptible to falling behind events if any downstream components slowed down. Shown in the diagram below, our downstream components process (parse, validate, and augment) and produce events in batches:

        Flow diagram showing the flow of information from Storefronts to Kafka Consumer to the Batch Processor to the Unlimited Go Routines to the Kafka Producer and finally to Third Party Partners.
        Old design

        The problem with our original design was that an unlimited number of threads would get spawned when batch events needed to be processed or produced. So when our service received an increase in events, an unsustainable number of goroutines were generated and ran concurrently.

        Goroutines can be thought of as lightweight threads that are functions or methods that run concurrently with other functions and threads. In a service, spawning an unlimited number of goroutines to execute increasingly growing tasks on a queue is never ideal. The machine executing these tasks will continue to expend its resources, like CPU and memory, until it reaches its limit. Furthermore, our team has a service level objective (SLO) of five minutes for event processing, so any delays in processing our data would exceed our processing beyond its timed deadline. In anticipation of three times the usual load for BFCM, we needed a way for our service to work smarter, not harder.

        Our solution? Go worker pools.

        The Solution

        A flow diagram showing the Go worker pool pattern
        Go worker pool pattern

        The worker pool pattern is a design in which a fixed number of workers are given a stream of tasks to process in a queue. The tasks stay in the queue until a worker is free to pick up the task and execute it. Worker pools are great for controlling the concurrent execution for a set of defined jobs. As a result of these workers controlling the amount of concurrent goroutines in action, less stress is put on our system’s resources. This design also worked perfectly for scaling up in anticipation of BFCM without relying entirely on vertical or horizontal scaling.

        When tasked with this new design, I was surprised at the intuitive setup for worker pools. The premise was creating a Go channel that receives a stream of jobs. You can think of Go channels as pipes that connect concurrent goroutines together, allowing them to communicate with each other. You send values into channels from one goroutine and receive those values into another goroutine. The Go workers retrieve their jobs from a channel as they become available, given the worker isn’t busy processing another job. Concurrently, the results of these jobs are sent to another Go channel or to another part of the pipeline.

        So let me take you through the logistics of the code!

        The Code

        I defined a worker interface that requires a CompleteJobs function that requires a go channel of type Job.

        The Job type takes the event batch, that’s integral to completing the task, as a parameter. Other types, like NewProcessorJob, can inherit from this struct to fit different use cases of the specific task.

        New workers are created using the function NewWorker. It takes workFunc as a parameter which processes the jobs. This workFunc can be tailored to any use case, so we can use the same Worker interface for different components to do different types of work. The core of what makes this design powerful is that the Worker interface is used amongst different components to do varying different types of tasks based on the Job spec.

        CompleteJobs will call workFunc on each Job as it receives it from the jobs channel. 

        Now let’s tie it all together.

        Above is an example of how I used workers to process our events in our pipeline. A job channel and a set number of numWorkers workers are initialized. The workers are then posed to receive from the jobs channel in the CompleteJobs function in a goroutine. Putting go before the CompleteJobs function allows the function to run in a goroutine!

        As event batches get consumed in the for loop above, the batch is converted into a Job that’s emitted to the jobs channel with the <- keyword. Each worker receives these jobs accordingly from the jobs channel. The goroutine we previously called with go worker.CompleteJobs(jobs, &producerWg) runs concurrently and receives these jobs.

        But wait, how do the workers know when to stop processing events?

        When the system is ready to be scaled down, wait groups are used to ensure that any existing tasks in flight are completed before the system shuts down. A waitGroup is a type of counter in Go that blocks the execution of a function until its internal counter becomes zero. As the workers were created above, the waitGroup counter was incremented for every worker that was created with the function producerWg.Add(1). In the CompleteJobs function wg.Done() is executed when the jobs channel is closed and jobs stop being received. wg.Done decrements the waitGroup counter for every worker.

        When a context cancel signal is received (signified by <- ctx.Done() above ), the remaining batches are sent to the Job channel so the workers can finish their execution. The Job channel is closed safely enabling the workers to break out of the loop in CompleteJobs and stop processing jobs. At this point, the WaitGroups’ counters are zero and the outputBatches channel,where the results of the jobs get sent to, can be closed safely. 

        The Improvements

        A flow diagram showing the new design flow from Storefronts to Kafka consumer to batch processor to Go worker pools to Kafka producer to Third party partners.
        New design

        Once deployed, the time improvement using the new worker pool design was promising. I conducted load testing that showed as more workers were added, more events could be processed on one pod. As mentioned before, in our previous implementation our service could only handle around 7.75 thousand events per second per pod in production without adding to our consumption lag.

        My team initially set the number of workers to 15 each in the processor and producer. This introduced a processing lift of  66% (12.9 thousand events per second per pod). By upping the workers to 50, we increased our event load by 149% from the old design resulting in 19.3 thousand events per second per pod. Currently, with performance improvements we can do 21 thousand events per second per pod. A 170% increase! This was a great win for the team and gave us the foundation to be adequately prepared for BFCM 2021, where we experienced a max of 46 thousand events per second!

        Go worker pools are a lightweight solution to speed up computation and allow concurrent tasks to be more performant. This go worker pool design has been reused to work with other components of our service such as validation, parsing, and augmentation. 

        By using the same Worker interface for different components, we can scale out each part of our pipeline differently to meet its use case and expected load. 

        Cheers to more quiet BFCMs!

        Kyra Stephen is a backend software developer on the Event Streaming - Customer Behavior team at Shopify. She is passionate about working on technology that impacts users and makes their lives easier. She also loves playing tennis, obsessing about music and being in the sun as much as possible.


        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        Shopify Data’s Guide To Opportunity Sizing

        Shopify Data’s Guide To Opportunity Sizing

        For every initiative that a business takes on, there is an opportunity potential and a cost—the cost of not doing something else. But how do you tangibly determine the size of an opportunity?

        Opportunity sizing is a method that data scientists can use to quantify the potential impact of an initiative ahead of making the decision to invest in it. Although businesses attempt to prioritize initiatives, they rarely do the math to assess the opportunity, relying instead on intuition-driven decision making. While this type of decision making does have its place in business, it also runs the risk of being easily swayed by a number of subtle biases, such as information available, confirmation bias, or our intrinsic desire to pattern-match a new decision to our prior experience.

        At Shopify, our data scientists use opportunity sizing to help our product and business leaders make sure that we’re investing our efforts in the most impactful initiatives. This method enables us to be intentional when checking and discussing the assumptions we have about where we can invest our efforts.

        Here’s how we think about opportunity sizing.

        How to Opportunity Size

        Opportunity sizing is more than just a tool for numerical reasoning, it’s a framework businesses can use to have a principled conversation about the impact of their efforts.

        An example of opportunity sizing could look like the following equation: if we build feature X, we will acquire MM (+/- delta) new active users in T timeframe under DD assumptions.

        So how do we calculate this equation? Well, first things first, although the timeframe for opportunity sizing an initiative can be anything relevant to your initiative, we recommend an annualized view of the impact so you can easily compare across initiatives. This is important because when your initiative goes live, it can have a significant impact on the in-year estimated impact of your initiative.

        Diving deeper into how to size an opportunity, below are a few methods we recommend for various scenarios.

        Directional T-Shirt Sizing

        Directional t-shirt sizing is the most common approach when opportunity sizing an existing initiative and is a method anyone (not just data scientists) can do with a bit of data to inform their intuition. This method is based on rough estimates and depends on subject matter experts to help estimate the opportunity size based on similar experiences they’ve observed in the past and numbers derived from industry standards. The estimates used in this method rely on knowing your product or service and your domain (for example, marketing, fulfillment, etc.). Usually the assumptions are generalized, assuming overall conversion rates using averages or medians, and not specific to the initiative at hand.

        For example, let’s say your Growth Marketing team is trying to update an email sequence (an email to your users about a new product or feature) and is looking to assess the size of the opportunity. Using the directional t-shirt sizing method, you can use the following data to inform your equation:

        1. The open rates of your top-performing content
        2. The industry average of open rates 

        Say your top-performing content has an open rate of five percent, while the industry average is ten percent. Based on these benchmarks, you can assume that the opportunity can be doubled (from five to ten percent).

        This method offers speed over accuracy, so there is a risk of embedded biases and lack of thorough reflection on the assumptions made. Directional t-shirt sizing should only be used in the stages of early ideation or sanity checking. Opportunity sizing for growth initiatives should use the next method: bottom-up.

        A matrix diagram with high rigor, lower rigor, new initiatives and existing initiatives as categories. It highlights that Directional T-Shirt sizing requires lower rigor opportunities
        Directional t-shirt sizing should be used for existing initiatives that require lower rigor opportunities.

        Bottom-Up Using Comparables

        Unlike directional t-shirt sizing, the bottom-up method uses the performance of a specific comparable product or system as a benchmark, and relies on the specific skills of a data scientist to make data-informed decisions. The bottom-up method is used to determine the opportunity of an existing initiative. The bottom-up method relies on observed data on similar systems, which means it tends to have a higher accuracy than directional t-shirt sizing. Here are some tips for using the bottom-up method:

        1. Understand the performance of a product or system that is comparable.

        To introduce any enhancements to your current product or system, you need to understand how it’s performing in the first place. You’ll want to identify, observe and understand the performance rates in a comparable product or system, including the specifics of its unique audience and process.

        For example, let’s say your Growth Marketing team wants to localize a new welcome email to prospective users in Italy that will go out to 100,000 new leads per year. A comparable system could be a localized welcome email in France that the team sent out the prior year. With your comparable system identified, you’ll want to dig into some key questions and performance metrics like:

        • How many people received the email?
        • Is there anything unique about that audience selection? 
        • What is the participation rate of the email? 
        • What is the conversion rate of the sequence? Or in other words, of those that opened your welcome email, how many converted to customers?

        Let’s say we identified that our current non-localized email in Italy has a click through rate (CTR) of three percent, while our localized email in France has a CTR of five percent over one year. Based on the metrics of your comparable system, you can identify a base metric and make assumptions of how your new initiative will perform.

        2. Be clear and document your assumptions.

        As you think about your initiative, be clear and document your assumptions about its potential impact and the why behind each assumption. Using the performance metrics of your comparable system, you can generate an assumed base metric and the potential impact your initiative will have on that metric. With your base metric in hand, you’ll want to consider the positive and negative impacts your initiative may have, so quantify your estimate in ranges with an upper and lower bound.

        Returning to our localized welcome email example, based on the CTR metrics from our comparable system we can assume the impact of our Italy localization initiative: if we send out a localized welcome email to 100,000 new leads in Italy, we will obtain a CTR between three and five percent (+/- delta) in one year. This is based on our assumptions that localized content will perform better than non-localized content, as seen in the performance metrics of our localized welcome email in France.

        3. Identify the impact of your initiative on your business’ wider goals.

        Now that you have your opportunity sizing estimate for your initiative, the next question that comes to mind is “what does that mean for the rest of your business goals?”. To answer this, you’ll want to estimate the impact on your top-line metric. This enables you to compare different initiatives with an apples-to-apples lens, while also avoiding the tendency to bias to larger numbers when making comparisons and assessing impact. For example, a one percent change in the number of sessions can look much bigger than a three percent change in the number of customers which is further down the funnel.

        Returning to our localized welcome email example, we should ask ourselves how an increase in CTR impacts our topline metric of active user count? Let’s say that when we localized the email in France, we saw an increase of five percent in CTR that translated to a three percent increase in active users per year. Accordingly, if we localize the welcome email in Italy, we may expect to get a three percent increase which would translate to 3,000 more active users per year.

        Second order thinking is a great asset here. It’s beneficial for you to consider potential modifiers and their impact. For instance, perhaps getting more people to click on our welcome email will reduce our funnel performance because we have lower intent people clicking through. Or perhaps it will improve funnel performance because people are better oriented to the offer. What are the ranges of potential impact? What evidence do we have to support these ranges? From this thinking, our proposal may change: we may not be able to just change our welcome email, we may also have to change landing pages, audience selection, or other upstream or downstream aspects.

        A matrix diagram with high rigor, lower rigor, new initiatives and existing initiatives as categories. It highlights that Bottom-up sizing is used for existing initiatives that requires higher rigor opportunities
        Bottom-up opportunity sizing should be used for existing initiatives that require higher rigor opportunities.

        Top-Down

        The top-down method should be used when opportunity sizing a new initiative. This method is more nuanced as you’re not optimizing something that exists. With the top-down method, you’ll start by using a larger set of vague information, which you’ll then attempt to narrow down into a more accurate estimation based on assumptions and observations. 2

        Here are a few tips on how to implement the top-down method:

        1. Gather information about your new initiative.

        Unlike the bottom-up method, you won’t have a comparable system to establish a base metric. Instead, seek as much information on your new initiative as you can from internal or external sources.

        For example, let’s say you’re looking to size the opportunity of expanding your product or service to a new market. In this case, you might want to get help from your product research team to gain more information on the size of the market, number of potential users in that market, competitors, etc.

        2. Be clear and document your assumptions.

        Just like the bottom-up method, you’ll want to clearly identify your estimates and what evidence you have to support them. For new initiatives, typically assumptions are going to lean towards being more optimistic than existing initiatives because we’re biased to believe that our initiatives will have a positive impact. This means you need to be rigorous in testing your assumptions as part of this sizing process. Some ways to test your assumptions include:

        • Using the range of improvement of previous initiative launches to give you a sense of what's possible. 
        • Bringing the business case to senior stakeholders and arguing your case. Often this makes you have to think twice about your assumptions.

        You should be conservative in your initial estimates to account for this lack of precision in your understanding of the potential.

        Looking at our example of opportunity sizing a new market, we’ll want to document some assumptions about:

        • The size of the market: What is the size of the existing market versus the new market size. You can gather this information from external datasets. In the absence of data on a market or audience, you can also make assumptions based on similar audiences or regions elsewhere. 
        • The rate at which you think you can reach and engage this market: This includes the assumed conversion rates of new users. The conversion rates may be assumed to be similar to past performance when a new channel or audience was introduced. You can use the tips identified in the bottom-up method.

        3. Identify the impact of your initiative on your business’ wider goals.

        Like the bottom-up method, you need to assess the impact your initiative will have on your business’ wider goals. Based on the above example, what does the assumed impact of our initiative mean in terms of active users?

        A matrix diagram with high rigor, lower rigor, new initiatives and existing initiatives as categories. It highlights that Top-down sizing is for new initiatives that require higher rigor opportunities
        Top-down opportunity sizing should be used for new initiatives that require higher rigor opportunities.

        And there you have it! Opportunity sizing is a worthy investment that helps you say yes to the most impactful initiatives. It’s also a significant way for data science teams to help business leaders prioritize and steer decision-making. Once your initiative launches, test to see how close your sizing estimates were to actuals. This will help you hone your estimates over time.

        Next time your business is outlining its product roadmap, or your team is trying to decide whether it’s worth it to take on a particular project, use our opportunity sizing basics to help identify the potential opportunity (or lack thereof).

        Dr. Hilal is the VP of Data Science at Shopify, responsible for overseeing the data operations that power the company’s commercial and service lines.


        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Data Science & Engineering career page to find out about our open positions. Learn about how we’re hiring to design the future together—a future that is Digital by Design.

        Continue reading

        How We Built Oxygen: Hydrogen’s Counterpart for Hosting Custom Storefronts

        How We Built Oxygen: Hydrogen’s Counterpart for Hosting Custom Storefronts

        In June, we shared the details of how we built Hydrogen, our React-based framework for building custom storefronts. We talked about some of the big bets we made on new technologies like React Server components, and the many internal and external collaborations that made Hydrogen a reality.  

        This time we tell the story of Oxygen, Hydrogen’s counterpart that makes hosting Hydrogen custom storefronts easy and seamless. Oxygen guarantees fast and globally available storefronts that securely integrate with the Shopify ecosystem while eliminating additional costs of setting up third-party hosting tools. 

        We’ll dive into the experiences we focused on, the technical choices we made to build those experiences, and how those choices paved the path for Shopify to get involved in standardizing serverless runtimes in partnership with leaders like Cloudflare and Deno.

        Shopify-Integrated Merchant Experience

        Let’s first briefly look at why we built Oxygen. There are existing products in the market that can host Hydrogen custom storefronts. Oxygen’s uniqueness is in the tight integration it provides with Shopify. Our technical choices so far have largely been grounded in ensuring this integration is frictionless for the user.

        We started with GitHub for version control, GitHub actions for continuous deployment, and Cloudflare for worker runtimes and edge distribution. We combined these third-party services with first-party services such as Shopify CDN, Shopify Admin API, and Shopify Identity and Access Management. They’re glued together by Oxygen-scoped services that additionally provide developer tooling and observability. Oxygen today is the result of bundling together this collection of technologies.

        A flow diagram highlighting the interaction between the developer, GitHub, Oxygen, Cloudflare, the buyer, and Shopify
        Oxygen overview

        We introduced the Hydrogen sales channel as the connector between Hydrogen, Oxygen, and the shop admin. The Hydrogen channel is the portal that provides controls to create and manage custom storefronts, link them to the rest of shop administrative functions, and connect them to Oxygen for hosting. It is built on Shopify’s standard Rails and React stack, leveraging Polaris design system for consistent user experience across Shopify-built admin experiences.

        Fast Buyer Experience

        Oxygen exists to give merchants the confidence that Shopify will deliver an optimal buyer experience while merchants focus on their entrepreneurial objectives. Optimal buyer experience in Oxygen’s context is a combination of high availability guarantees, super fast site performance from anywhere in the world, and resilience to handle high-volume traffic.

        To Build or To Reuse

        This is where we had the largest opportunity to contemplate our technical direction. We could leverage over a decade’s experience at Shopify in building infrastructure solutions that keep the entire Shopify platform up and running to build an infrastructure layer, control plane, and proprietary V8 isolates. In fact, we did briefly and it was a successful venture! However, we ultimately decided to opt for Cloudflare’s battle-hardened worker infrastructure that guarantees buyer access to storefronts within milliseconds due to global edge distribution.

        This foundational decision significantly simplified upfront infrastructural complexity, scale and security risk considerations, allowing us to get Oxygen to merchants faster and validate our bets. These choices also leave us enough room to go back and build our own proprietary version at scale or a simpler variation of it if it makes sense both for the business and the users.

        A flow diagram highlighting the interactions between the Shopify store, Cloudflare, and the Oxygen workers.
        Oxygen workers overview

        We were able to provide merchants and buyers the promised fast performance while locking in access controls. When a buyer makes a request to a Hydrogen custom storefront hosted at myshop.com, that request is received by Oxygen’s Gateway Worker running in Cloudflare. This worker is responsible for validating that the accessor has the necessary authorization to the shop and the specific storefront version before routing them to the Storefront Worker that is running the Hydrogen-based storefront code. The worker chaining is made possible using Cloudflare’s new Dynamic Dispatch API from Workers for Platforms.

        Partnerships and Open Source

        Rather than reinventing the wheel, we took the opportunity to work with leaders in the JavaScript runtimes space to collectively evolve the ecosystem. We use and contribute to the Workers for Platforms solution through tight feedback loops with Cloudflare. We also jointly established WinterCG, a JavaScript runtimes community group in partnership with Cloudflare, Vercel, Deno, and others. We leaned in to collectively building with and for the community, just the way we like it at Shopify.

        Familiar Developer Experience

        Oxygen largely provides developer-oriented capabilities and we strive to provide a developer experience that cohesively integrates with existing developer workflows. 

        Continuous Data-Informed Improvements

        While the Oxygen platform takes care of the infrastructure and distribution management of custom storefronts, it surfaces critical information about how custom storefronts are performing in production, ensuring fast feedback loops throughout the development lifecycle. Specifically, runtime logs and metrics are surfaced through the observability dashboard within the Hydrogen channel for troubleshooting and performance trend monitoring. The developer-oriented user can extrapolate actions necessary to further improve the site quality.

        A flow diagram highlighting the observability enablement side
        Oxygen observability overview

        We made very deliberate technical choices again on the observability enablement side. Unlike Shopify’s internal observability stack, Oxygen’s observability stack consists of Grafana for dashboards and alerting, Cortex for metrics, Loki for logging, and Tempo for tracing. Under the hood, Oxygen’s Trace Worker runs in Cloudflare and attaches itself to Storefront Workers to capture all of the logging and metrics information and forwards all of it to our Grafana stack. Logs are sent to Loki and metrics are sent to Cortex where the Oxygen Observability Service pulls both on-demand when the Hydrogen channel requests it.

        The tech stack was chosen for two key purposes: Oxygen provided a good test bed to experiment and evaluate these tools for a potential long-term fit for the rest of Shopify, and Oxygen’s use case is fundamentally different from internal Shopify. To support the latter, we needed a way to separate internal-facing from external-facing metrics cleanly while scaling to the data loads. We also needed the tool to be flexible enough that we can provide merchants optionality to integrate with any existing monitoring tools in their workflows.

        What’s Next

        Thanks to many flexible, eager, and collaborative merchants who maintained tight feedback loops every step of the way, Oxygen is used in production today by Allbirds, Shopify Supply, Shopify Hardware, and Denim Tears. It is generally available to our Plus merchants as of June.

        We’re just getting started though! We have our eyes on unlocking composable, plug-and-play styled usage in addition to surfacing deeper development insights earlier in the development lifecycle to shorten feedback loops. We also know there is a lot of opportunity for us to enhance the developer experience by reducing the number of surfaces to interact with, providing more control from the command line, and generally streamlining the Oxygen developer tools with the overall Shopify developer toolbox.

        We’re eager to take in all the merchant feedback as they demand the best of us. It helps us discover, learn, and push ourselves to revalidate assumptions, which will ultimately create new opportunities for the platform to evolve.

        Curious to learn more? We encourage you to check out the docs!

        Sneha is an engineering leader on the Oxygen team and has contributed to various teams building tools for a developer-oriented audience.


        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        How We Enable Two-Day Delivery in the Shopify Fulfillment Network

        How We Enable Two-Day Delivery in the Shopify Fulfillment Network

        Merchants sign up for Shopify Fulfillment Network (SFN) to save time on storing, packing, and shipping their merchandise, which is distributed and balanced across our network of Shopify-certified warehouses throughout North America. We optimize the placement of inventory so it’s closer to buyers, saving on shipping time, costs, and carbon emissions.

        Recently, SFN also made it easier and more affordable for merchants to offer two-day delivery  across the United States. We consider many real-world factors to provide accurate delivery dates, optimizing for sustainable ground shipment methods and low cost.

        A Platform Solution

        As with most new features at Shopify, we couldn’t just build a custom solution for SFN. Shopify is an extensible platform with a rich ecosystem of apps that solve complex merchant problems. Shopify Fulfillment Network is just one of the many third-party logistics (3PL) solutions available to merchants. 

        This was a multi-team initiative. One team built the delivery date platform in the core of Shopify, consisting of a new set of GraphQL APIs that any 3PL can use to upload their delivery dates to the Shopify platform. Another team integrated the delivery dates into the Shopify storefront, where they are shown to buyers on the product details page and at checkout.  A third team built the system to calculate and upload SFN delivery dates to the core platform. SFN is an app that merchants install in their shops, in the same way other 3PLs interact with the Shopify platform. The SFN app calls the new delivery date APIs to upload its own delivery dates to Shopify. For accuracy, the SFN delivery dates are calculated using network, carrier and product data. Let’s take a closer look at these inputs to the delivery date.

        Four Factors That Determine Delivery Date

        There are many factors to predict when a package will leave the warehouse, and how long it will spend in transport to the destination address. Each 3PL has its own particular considerations, and the Shopify platform is flexible enough to support them all. Requiring them to conform to fine-grained platform primitives such as operating days and capacity or processing and transit times would only result in loss of fidelity of their particular network and processes.

        With that in mind, we let 3PLs populate the platform with their pre-computed delivery dates that Shopify surfaces to buyers on the product details page and checkout. The 3PL has full control over all the factors that affect their delivery dates. Let’s take a look at some of these factors.

        1. Proximity to the Destination

        The time required for delivery is dependent on the distance the package must travel to arrive at its destination. Usually, the closer the inventory is to the destination, the faster the delivery. This means that SFN delivery dates depend on specific inventory availability throughout the network. 

        2. Heavy, Bulky, or Dangerous Goods

        Some carrier services aren’t applicable to merchandise that exceeds a specific weight or dimensions. Others can’t be used to transport hazardous materials. The delivery date we predict for such items must be based on a shipping carrier service that can meet the requirements.

        3. Time Spent at Fulfillment Centers

        Statutory holidays and warehouse closures affect when the package can be ready for shipment. Fulfillment centers have regular operating calendars and sometimes exceptional circumstances can force them to close, such as severe weather events.

        On any given operating day, warehouse staff need time to pick and pack items into packages for shipment. There’s also a physical limit to how many packages can be processed in a day. Again, exceptional circumstances such as illness can reduce the staff available at the fulfillment center, reducing capacity and increasing processing times.

        4. Time Spent in the Hands of Shipping Carriers

        In SFN, shipping carriers such as UPS and USPS are used to transport packages from the warehouse to their destination. Just like the warehouses, shipping carriers have their own holidays and closures that affect when the package can be picked up, transported, and delivered. These are modeled as carrier transit days, when packages are moved between hubs in the carrier’s own network, and delivery days, when packages are taken from the last hub to the final delivery address.

        Shipping carriers send trucks to pick up packages from the warehouse at scheduled times of the day. Orders that are made after the last pickup for the day have to wait until the next day to be shipped out. Some shipping carriers only pick up from the warehouse if there are enough packages to make it worth their while. Others impose a limit on the volume of packages they can take away from the warehouse, according to their truck dimensions. These capacity limits influence our choice of carrier for the package.

        Shipping carriers also publish the expected number of days it takes for them to transport a package from the warehouse to its destination (called Time in Transit). The first transit day is the day after pickup, and the last transit day is the delivery day. Some carriers deliver on Saturdays even though they won’t transport the package within their network on a Saturday.

        Putting It All Together

        Together, all of these factors are considered when we decide which fulfillment center should process a shipment and which shipping carrier service to use to transport it to its final destination. At regular intervals, we select a fulfillment center and carrier service that optimizes for sustainable two-day delivery for every merchant SKU to every US zip code. From this, we upload pre-calculated schedules of delivery dates to the Shopify platform.

        That’s a Lot of Data

        It’s a lot of data, but much of it can be shared between merchant SKUs. Our strategy is to produce multiple schedules, each one reflecting the delivery dates for inventory available at the same set of fulfillment centers and with similar characteristics such as weight and dimensions. Each SKU is mapped to a schedule, and SKUs from different shops can share the same schedule.

        Example mapping of SKUs to schedules of delivery dates
        Example mapping of SKUs to schedules of delivery dates

        In this example, SKU-1.2 from Shop 1 and SKU-3.1 from Shop 3 share the same schedule of delivery dates for heavy items stocked in both California and New York. If an order is placed today by 4pm EST for SKU-1.2 or SKU-3.1, shipping to zip code 10019 in New York, it will arrive on June 10. Likewise, if an order is placed today for SKU-1.2 or SKU-3.1 by 3pm PST, shipping to zip code 90002 in California, it will arrive on June 11.

        Looking Up Delivery Dates in Real Time

        When Shopify surfaces a delivery date on a product details page or at checkout, it’s a direct, fast lookup (under 100 ms) to find the pre-computed date for that item to the destination address. This is because the SFN app uploads pre-calculated schedules of delivery dates to the Shopify platform and maps each SKU to a schedule using the delivery date APIs.  

        SFN System overview
        System overview

        The date the buyer sees at checkout is sent back to SFN during order fulfillment, where it’s used to ensure that the order is routed to a warehouse and shipping label that meets the delivery date.

        There you have it, a highly simplified overview of how we built two-day delivery in the SFN.


        Learn More About SFN

        Spin Cycle: Shopify's SFN Team Overcomes a Cloud-Development Spiral

        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        Instant Performance Upgrade: From FlatList to FlashList

        Instant Performance Upgrade: From FlatList to FlashList

        When was the last time you scrolled through a list? Odds are it was today, maybe even seconds ago. Iterating over a list of items is a feature common to many frameworks, and React Native’s FlatList is no different. The FlatList component renders items in a scrollable view without you having to worry about memory consumption and layout management (sort of, we’ll explain later).

        The challenge, as many developers can attest to, is getting FlatList to perform on a range of devices without display artifacts like drops in UI frames per second (FPS) and blank items while scrolling fast.

        Our React Native Foundations team solved this challenge by creating FlashList, a fast and performant list component that can be swapped into your project with minimal effort. The requirements for FlashList included

        • High frame rate: even when using low-end devices, we want to guarantee the user can scroll smoothly, at a consistent 60 FPS or greater.
        • No more blank cells: minimize the display of empty items caused by code not being able to render them fast enough as the user scrolls (and causing them to wonder what’s up with their app).
        • Developer friendly: we wanted FlashList to be a drop-in replacement for FlatList and also include helpful features beyond what the current alternatives offer.

        The FlashList API is five stars. I've been recommending all my friends try FlashList once we open source.

        Daniel Friyia, Shopify Retail team
        Performance comparison between FlatList and FlashList

        Here’s how we approached the FlashList project and its benefits to you.

        The Hit List: Why the World Needs Faster Lists

        Evolving FlatList was the perfect match between our mission to continuously create powerful components shared across Shopify and solving a difficult technical challenge for React Native developers everywhere. As more apps move from native to React Native, how could we deliver performance that met the needs of today’s data-hungry users while keeping lists simple and easy to use for developers?

        Lists are in heavy use across Shopify, in everything from Shop to POS. For Shopify Mobile in particular, where over 90% of the app uses native lists, there was growing demand for a more performant alternative to FlatList as we moved more of our work to React Native. 

        What About RecyclerListView?

        RecyclerListView is a third-party package optimized for rendering large amounts of data with very good real-world performance. The difficulty lies in using it, as developers must spend a lot of time understanding how it works, playing with it, and requiring significant amounts of code to manage. For example, RecyclerListView needs predicted size estimates for each item and if they aren’t accurate, the UI performance suffers. It also renders the entire layout in JavaScript, which can lead to visible glitches.

        When done right, RecyclerListView works very well! But it’s just too difficult to use most of the time. It’s even harder if you have items with dynamic heights that can’t be measured in advance.

        So, we decided to build upon RecyclerListView and solve these problems to give the world a new list library that achieved native-like performance and was easy to use.

        Shop app's search page using FlashList

        The Bucket List: Our Approach to Improving React Native Lists

        Our React Native Foundations team takes a structured approach to solving specific technical challenges and creates components for sharing across Shopify apps. Once we’ve identified an ambitious problem, we develop a practical implementation plan that includes rigorous development and testing, and an assessment of whether something should be open sourced or not. 

        Getting list performance right is a popular topic in the React Native community, and we’d heard about performance degradations when porting apps from native to React Native within Shopify. This was the perfect opportunity for us to create something better. We kicked off the FlashList project to build a better list component that had a similar API to FlatList and boosted performance for all developers. We also heard some skepticism about its value, as some developers rarely saw issues with their lists on newer iOS devices.

        The answer here was simple. Our apps are used on a wide range of devices, so developing a performant solution across devices based on a component that was already there made perfect sense.

        “We went with a metrics-based approach for FlashList rather than subjective, perception-based feelings. This meant measuring and improving hard performance numbers like blank cells and FPS to ensure the component worked on low-end devices first, with high-end devices the natural byproduct.” - David Cortés, React Native Foundations team

        FlashList feedback via Twitter

        Improved Performance and Memory Utilization

        Our goal was to beat the performance of FlatList by orders of magnitude, measured by UI thread FPS and JavaScript FPS. With FlatList, developers typically see frame drops, even with simple lists. With FlashList, we improved upon the FlatList approach of generating new views from scratch on every update by using an existing, already allocated view and refreshing elements within it (that is, recycling cells). We also moved some of the layout operations to native, helping smooth over some of the UI glitches seen when RecyclerListView is provided with incorrect item sizes.

        This streamlined approach boosted performance to 60 FPS or greater—even on low-end devices!

        Comparison between FlatList and FlashList via Twitter (note this is on a very low-end device)

        We applied a similar strategy to improve memory utilization. Say you have a Twitter feed with 200-300 tweets per page, FlatList starts rendering with a large number of items to ensure they’re available as the user scrolls up or down. FlashList, in comparison, requires a much smaller buffer which reduces memory footprint, improves load times, and keeps the app significantly more responsive. The default buffer is just 250 pixels.

        Both these techniques, along with other optimizations, help FlashList achieve its goal of no more blank cells, as the improved render times can keep up with user scrolling on a broader range of devices. 

        Shopify Mobile is using FlashList as the default and the Shop team moved search to FlashList. Multiple teams have seen major improvements and are confident in moving the rest of their screens.

        Talha Naqvi, Software Development Manager, Retail Accelerate

        These performance improvements included extensive testing and benchmarking on Android devices to ensure we met the needs of a range of capabilities. A developer may not see blank items or choppy lists on the latest iPhone but that doesn’t mean it’ll work the same on a lower-end device. Ensuring FlashList was tested and working correctly on more cost-effective devices made sure that it would work on the higher-end ones too.

        Developer Friendly

        If you know how FlatList works, you know how FlashList works. It’s easy to learn, as FlashList uses almost the same API as FlatList, and has new features designed to eliminate any worries about layout and performance, so you can focus on your app’s value.

        Shotgun's main feature is its feed, so ensuring consistent and high-quality performance has always been crucial. That's why using FlashList was a no brainer. I love that compared to the classic FlatList, you can scrollToIndex to an index that is not within your view. This allowed us to quickly develop our new event calendar list, where users can jump between dates to see when and where there are events.

        It takes seconds to swap your existing list implementation from FlatList to FlashList. All you need to do is change the component name and optionally add the estimatedItemSize prop, as you can see in this example from our documentation:

        Example of FlashList usage.

        Powerful Features

        Going beyond the standard FlatList props, we added new features based on common scenarios and developer feedback. Here are three examples:

        • getItemType: improves render performance for lists that have different types of items, like text vs. image, by leveraging separate recycling pools based on type.
        Example of getItemType usage
        • Flexible column spans: developers can create grid layouts where each item’s column span can be set explicitly or with a size estimate (using overrideItemLayout), providing flexibility in how the list appears.
        • Performance metrics: as we moved many layout functions to native, FlashList can extract key metrics for developers to measure and troubleshoot performance, like reporting on blank items and FPS. This guide provides more details.

        Documentation and Samples

        We provide a number of resources to help you get started with FlashList:

        Flash Forward with FlashList

        Accelerate your React Native list performance by installing FlashList now. It’s already deployed within Shopify Mobile, Shop, and POS and we’re working on known issues and improvements. We recommend starring the FlashList repo and can’t wait to hear your feedback!


        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        When is JIT Faster Than A Compiler?

        When is JIT Faster Than A Compiler?

        I had this conversation over and over before I really understood it. It goes:

        “X language can be as fast as a compiled language because it has a JIT compiler!”
        “Wow! It’s as fast as C?”
        “Well, no, we’re working on it. But in theory, it could be even FASTER than C!”
        “Really? How would that work?”
        “A JIT can see what your code does at runtime and optimize for only that. So it can do things C can’t!”
        “Huh. Uh, okay.”

        It gets hand-wavy at the end, doesn’t it? I find that frustrating. These days I work on YJIT, a JIT for Ruby. So I can make this extremely NOT hand-wavy. Let’s talk specifics.

        I like specifics.

        Wait, What’s JIT Again?

        An interpreter reads a human-written description of your app and executes it. You’d usually use interpreters for Ruby, Python, Node.js, SQL, and nearly all high-level dynamic languages. When you run an interpreted app, you download the human-written source code, and you have an interpreter on your computer that runs it. The interpreter effectively sees the app code for the first time when it runs it. So an interpreter doesn’t usually spend much time fixing or improving your code. They just run it how it’s written. An interpreter that significantly transforms your code or generates machine code tends to be called a compiler.

        A compiler typically turns that human-written code into native machine code, like those big native apps you download. The most straightforward compilers are ahead-of-time compilers. They turn human-written source code into a native binary executable, which you can download and run. A good compiler can greatly speed up your code by putting a lot of effort into improving it ahead of time. This is beneficial for users because the app developer runs the compiler for them. The app developer pays the compile cost, and users get a fast app. Sometimes people call anything a compiler if it translates from one kind of code to another—not just source code to native machine code. But when I say “compiler” here, I mean the source-code-to-machine-code kind.

        A JIT, aka a Just-In-Time compiler, is a partial compiler. A JIT waits until you run the program and then translates the most-used parts of your program into fast native machine code. This happens every time you run your program. It doesn’t write the code to a file—okay, except MJIT and a few others. But JIT compilation is primarily a way to speed up an interpreter—you keep the source code on your computer, and the interpreter has a JIT built into it. And then long-running programs go faster.

        It sounds kind of inefficient, doesn’t it? Doing it all ahead of time sounds better to me than doing it every time you run your program.

        But some languages are really hard to compile correctly ahead of time. Ruby is one of them. And even when you can compile ahead of time, often you get bad results. An ahead-of-time compiler has to create native code that will always be correct, no matter what your program does later, and sometimes that means it’s about as bad as an interpreter, which has that exact same requirement.

        Ruby is Unreasonably Dynamic

        Ruby is like my four-year-old daughter: the things I love most about it are what make it difficult.

        In Ruby, I can redefine + on integers like 3, 7, or -45. Not just at the start—if I wanted, I could write a loop and redefine what + means every time through that loop. My new + could do anything I want. Always return an even number? Print a cheerful message? Write “I added two numbers” to a log file? Sure, no problem.

        That’s thrilling and wonderful and awful in roughly equal proportions.

        And it’s not just +. It’s every operator on every type. And equality. And iteration. And hashing. And so much more. Ruby lets you redefine it all.

        The Ruby interpreter needs to stop and check every time you add two numbers if you have changed what + means in between. You can even redefine + in a background thread, and Ruby just rolls with it. It picks up the new + and keeps right on going. In a world where everything can be redefined, you can be forgiven for not knowing many things, but the interpreter handles it.

        Ruby lets you do awful, awful things. It lets you do wonderful, wonderful things. Usually, it’s not obvious which is which. You have expressive power that most languages say is a very bad idea.

        I love it.

        Compilers do not love it.

        When JITs Cheat, Users Win

        Okay, we’ve talked about why it’s hard for ahead-of-time (AOT) compilers to deliver performance gains. But then, how do JIT compilers do it? Ruby lets you constantly change absolutely everything. That’s not magically easy at runtime. If you can’t compile + or == or any operator, why can you compile some parts of the program?

        With a JIT, you have a compiler around as the program runs. That allows you to do a trick.

        The trick: you can compile the method wrong and still get away with it.

        Here’s what I mean.

        YJIT asks, “Well, what if you didn’t change what + means every time?” You almost never do that. So it can compile a version of your method where + keeps its meaning from right now. And so does equals, iteration, hashing, and everything you can change in Ruby but you nearly never do.

        But… that’s wrong. What if I do change those things? Sometimes apps do. I’m looking at you, ActiveRecord.

        But your JIT has a compiler around at runtime. So when you change what + means, it will throw away all those methods it compiled with the old definition. Poof. Gone. If you call them again, you get the interpreted version again. For a while—until JIT compiles a new version with the new definition. This is called de-optimization. When the code starts being wrong, throw it away. When 3+4 stops being 7 (hey, this is Ruby!), get rid of the code that assumed it was. The devil is in the details—switching from one version of a method to another version midway through is not easy. But it’s possible, and JIT compilers basically do it successfully.

        So your JIT can assume you don’t change + every time through the loop. Compilers and interpreters can’t get away with that.

        An AOT compiler has to create fully correct code before your app even ships. It’s very limited if you change anything. And even if it had some kind of fallback (“Okay, I see three things 3+4 could be in this app”), it can only respond at runtime with something it figured out ahead of time. Usually, that means very conservative code that constantly checks if you changed anything.

        An interpreter must be fully correct and respond immediately if you change anything. So they normally assume that you could have redefined everything at any time. The normal Ruby interpreter spends a lot of time checking if you changed the definition of + over time. You can do  clever things to speed up that check, and CRuby does. But if you make your interpreter extremely clever, pre-building optimized code and invalidating assumptions, eventually you realize that you’ve built a JIT.

        Ruby and YJIT

        I work on YJIT, which is part of CRuby. We do the stuff I mention here. It’s pretty fast.

        There are a lot of fun specifics to figure out. What do we track? How do we make it faster? When it’s invalid, do we need to recompile or cancel it? Here’s an example I wrote recently.

        You can try out our work by turning on --yjit on recent Ruby versions. You can use even more of our work if you build the latest head-of-master Ruby, perhaps with ruby-build 3.2.0-dev. You can also get all the details by reading the source, which is built right into CRuby.

        By the way, YJIT has some known bugs in 3.1 that mean you should NOT use it for real production. We’re a lot closer now—it should be production-ready for some uses in 3.2, which comes out Christmas 2022.

        What Was All That Again?

        A JIT can add assumptions to your code, like the fact that you probably didn’t change what + means. Those assumptions make the compiled code faster. If you do change what + means, you can throw away the now-incorrect code.

        An ahead-of-time compiler can’t do that. It has to assume you could change anything you want. And you can.

        An interpreter can’t do that. It has to assume you could have changed anything at any time. So it re-checks constantly. A sufficiently smart interpreter that pre-builds machine code for current assumptions and invalidates if it changes could be as fast as JIT… Because it would be a JIT.

        And if you like blog posts about compiler internals—who doesn’t?—you should hit “Yes, sign me up” up above and to the left.

        Noah Gibbs wrote the ebook Rebuilding Rails and then a lot about how fast Ruby is at various tasks. Despite being a grumpy old programmer in Inverness, Scotland, Noah believes that some day, somehow, there will be a second game as good as Stuart Smith’s Adventure Construction Set for the Apple IIe. Follow Noah on Twitter and GitHub


        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        Spin Cycle: Shopify’s SFN Team Overcomes a Cloud-Development Spiral

        Spin Cycle: Shopify’s SFN Team Overcomes a Cloud-Development Spiral

        You may have read about Spin, Shopify’s new cloud-based development tool. Instead of editing and running a local version of a service on a developer’s MacBook, Shopify is moving towards a world where the development servers are available on-demand as a container running in Kubernetes. When using Spin, you don’t need anything on your local machine other than an ssh client and VSCode, if that’s your editor of choice. 

        By moving development off our MacBooks and onto Spin, we unlock the ability to easily share work in progress with coworkers and can work on changes that span different codebases without any friction. And because Spin instances are lightweight and ephemeral, we don’t run the risk of messing up long-lived development databases when experimenting with data migrations.

        Across Shopify, teams have been preparing and adjusting their codebases so that their services can run smoothly in this kind of environment. In the Shopify Fulfillment Network (SFN) engineering org, we put together a team of three engineers to get us up and running on Spin.

        At first, it seemed like the job would be relatively straightforward. But as we started doing the work, we began to notice some less obvious forces at play that were pushing against our efforts.

        Since it was easier for most developers to use our old tooling instead of Spin while we were getting the kinks worked out, developers would often unknowingly commit changes that broke some functionality we’d just enabled for Spin. In hindsight, the process of getting SFN working on Spin is a great example of the kind of hidden difficulty in technical work that's more related to human systems than how to get bits of electricity to do what you want.

        Before we get to the interesting stuff, it’s important to understand the basics of the technical challenge. We'll start by getting a broad sense of the SFN codebase and then go into the predictable work that was needed to get it running smoothly in Spin. With that foundation, we’ll be able to describe how and why we started treading water, and ultimately how we’re pushing past that.

        The Shape of SFN

        SFN exists to take care of order fulfillment on behalf of Shopify merchants. After a customer has completed the checkout process, their order information is sent to SFN. We then determine which warehouse has enough inventory and is best positioned to handle the order. Once SFN has identified the right warehouse, it sends the order information to the service responsible for managing that warehouse’s operations. The state of the system is visible to the merchant through the SFN app running in the merchant’s Shopify admin. The SFN app communicates to Shopify Core via the same GraphQL queries and mutations that Shopify makes available to all app developers.

        At a highly simplified level, this is the general shape of the SFN codebase:

        Diagram of SFN codebase. SFN is a large rectangle in the center containing six boxes labelled Subcomponent. Outside of the SFN box are six directional arrows. Each arrow connects to boxes called Dependency
        SFN’s monolithic Rails application with many external dependencies

        Similar to the Shopify codebase, SFN is a monolithic Rails application divided into individual components owned by particular teams. Unlike Shopify Core, however, SFN has many strong dependencies on services outside of its own monolith.

        SFN’s biggest dependency is on Shopify itself, but there are plenty more. For example, SFN does not design shipping labels, but does need to send shipping labels to the warehouse. So, SFN is a client to a service that provides valid shipping labels. Similarly, SFN does not tell the mobile Chuck robots in a warehouse where to go— we are a client of a service that handles warehouse operations.

        The value that SFN provides is in gluing together a bunch of separate systems with some key logic living in that glue. There isn't much you can do with SFN without those dependencies around in some form.

        How SFN Handles Dependencies

        As software developers, we need quick feedback about whether an in-progress change is working as expected. And to know if something is working in SFN, that code generally needs to be validated alongside one or several of SFN’s dependencies. For example, if a developer is implementing a feature to display some text in the SFN app after a customer has placed an order, there’s no useful way to validate that change without also having Shopify available.

        So the work of getting a useful development environment for SFN with Spin appears to be about looking at each dependency, figuring how to handle that dependency, and then implementing that decision. We have a few options for how to handle any particular dependency when running SFN in Spin:

        1. Run an instance of the dependency directly in the same Spin container.
        2. Mock the dependency.
        3. Use a shared running instance of the dependency, such as a staging or live test environment.

        Given all the dependencies that SFN has, this seems like a decent amount of work for a three-person team.

        But this is not the full extent of the problem—it’s just the foundation.

        Once we added configuration to make some dependency or some functional flow of SFN work in Spin, another commit would often be added to SFN that nullifies that effort. For example, after getting some particular flow functioning in a Spin environment, the implementation of that flow might be rewritten with new dependencies that are not yet configured to work in Spin.

        One apparent solution to this problem would be simply to pay more attention to what work is in flight in the SFN codebase and better prepare for upcoming changes.

        But here’s the problem: It’s not just one or two flows changing. Across SFN, the internal implementation of functionality is constantly being improved and refactored. With over 150 SFN engineers deploying to production over 30 times a day, the SFN codebase doesn’t sit still for long. On top of that, Spin itself is constantly changing. And all of SFN’s dependencies are changing. For any dependencies that were mocked, those mocks will become stale and need to be updated.

        The more we accomplished, the more functionality existed with the potential to stop working when something changes. And when one of those regressions occurred, we needed to interrupt the dependency we were working on solving in order to keep a previously solved flow functioning. The tension between making improvements and maintaining what you’ve already built is central to much of software engineering. Getting SFN working on Spin was just a particularly good example.

        The Human Forces on the System

        After recognizing the problem, we needed to step back and look at the forces acting on the system. What incentive structures and feedback loops are contributing to the situation?

        In the case of getting SFN working on Spin, changes were happening frequently and those changes were causing regressions. Some of those changes were within our control (e.g., a change goes into SFN that isn’t configured to work in Spin), and some are less so (e.g., Spin itself changing how it uses certain inputs).

        This led us to observe two powerful feedback loops that could be happening when SFN developers are working in Spin:

        Two feedback loops: Loop of Happy Equilibrium and Spiral of Struggle.
        Loop of Happy Equilibrium and Spiral of Struggle

        If it’s painful to use Spin for SFN development, it’s less likely that developers will use Spin the next time they have to validate their work. And if a change hasn’t been developed and tested using Spin, maybe something about that change breaks a particular testing flow, and that causes another SFN developer to become frustrated enough to stop using Spin. And this cycle continues until SFN is no longer usable in Spin.

        Alternatively, if it’s a great experience to use and validate work in Spin, developers will likely want to continue using the tool, which will catch any Spin-specific regressions before they make it into the main branch.

        As you can imagine, it’s very difficult to move from the Spiral of Struggle into the positive Loop of Happy Equilibrium. Our solution is to try our best to dampen the force acting on the negative spiral while simultaneously propping up the force of the positive feedback loop. 

        As the team focused on getting SFN working on Spin, our solution to this problem was to be very intentional about where we spent our efforts while asking the other SFN developers to endure a little pain and pitch in as we go through this transition. The SFN-on-Spin team narrowed its focus to just getting SFN to a basic level of functionality on Spin so that most developers could use it for the most common validation flows, and we prioritized fixing any bugs that disrupted those areas. This meant explicitly not working to get all SFN functionality running Spin, but just enough so that we could manage upkeep. And at the same time, we asked other SFN developers to use Spin for their daily work, even though it’s missing functionality they need or want. Where they feel frustrations or see gaps, we encouraged and supported them in adding the functionality they needed.

        Breaking the cycle

        Our hypothesis is that this is a temporary stage of transition to cloud development. If we’re successful, we’ll land in the Loop of Happy Equilibrium where regressions are caught before they’re merged, individuals add the missing functionality they need, and everyone ultimately has a fun time developing. They will feel confident about shipping their code.

        Our job seems to be all about code and making computers do what we say. But many of the real-life challenges we face when working on a codebase are not apparent from code or architecture diagrams. Instead they require us to reflect on the forces operating on the humans that are building that software. And once we have an idea of what those forces might be, we can brainstorm how to disrupt or encourage the feedback loops we’ve observed.

        Jen is a Staff Software Engineer at Shopify who's spent her career seeking out and building teams that challenge the status quo. In her free time, she loves getting outdoors and spending time with her chocolate lab.


        Learn More About Spin

        The Journey to Cloud Development: How Shopify Went All-In On Spin
        The Story Behind Shopify's Isospin Tooling
        Spin Infrastructure Adventures: Containers, Systemd, and CGroups

        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        Mastering React’s Stable Values

        Mastering React’s Stable Values

        The concept of stable value is a distinctly React term, and especially relevant since the introduction of Functional ComponentsIt refers to values (usually coming from a hook) that have the same value across multiple renders. And they’re immediately confusing. In this post, Colin Gray, Principal Developer at Shopify, walks through some cases where they really matter and how to make sense of them.

        Continue reading

        10 Tips for Building Resilient Payment Systems

        10 Tips for Building Resilient Payment Systems

        During the past five years I’ve worked on a lot of different parts of Shopify’s payment infrastructure and helped onboard dozens of developers in one way or another. Some people came from different programming languages, others used Ruby before but were new to the payments space. What was mostly consistent among new team members was little experience in building systems at Shopify’s scale—it was new for me too when I joined.

        It’s hard to learn something when you don’t know what you don’t know. As I learned things over the years—sometimes the hard way—I eventually found myself passing on these lessons to others. I distilled these topics into a presentation I gave to my team and boiled that down into this blog post. So, without further ado, here are my top 10 tips and tricks for building resilient payment systems.

        1. Lower Your Timeouts

        Ruby’s built-in Net::HTTP client has a default timeout of 60 seconds to open a connection to a server, 60 seconds to write data, and another 60 seconds to read a response. For online applications where a human being is waiting for something to happen, this is too long. At least there’s a default timeout in place. HTTP clients in other programming languages, like http.Client in Go and http.request in Node.JS don’t have a default timeout at all! This means an unresponsive server could tie up your resources indefinitely and increase your infrastructure bill unnecessarily.

        Timeouts can also be set in data stores. For example MySQL has the MAX_EXECUTION_TIME optimizer hint for setting a per-SELECT query timeout in milliseconds. Combined with other tools like pt-kill, we try to prevent bad queries from overwhelming the database.

        If there’s only a single thing you take away from this post, dear reader, it should be to investigate and set low timeouts everywhere you can. But what is the right timeout to set? you may wonder. That ultimately depends on your application’s unique situation and can be deduced with monitoring (more on that later), but I found that an open timeout of one second with a write and read or query timeout of five seconds is a decent starting point. Consider this waiting time from the perspective of the end user: would you like to wait for more than five seconds for a page to load successfully or show an error?

        2. Install Circuit Breakers

        Timeouts put an upper bound on how long we wait before giving up. But services that go down tend to stay down for a while, so if we see multiple timeouts in a short period of time, we can improve on this by not trying at all. Much like the circuit breaker you will find in your house or apartment, once the circuit is opened or tripped, nothing is let through.

        Shopify developed Semian to protect Net::HTTP, MySQL, Redis, and gRPC services with a circuit breaker in Ruby. By raising an exception instantly once we detect a service being down, we save resources by not waiting for another timeout we expect to happen. In some cases rescuing these exceptions allows you to provide a fallback. Building and Testing Resilient Ruby on Rails Applications describes how we design and unit tests such fallbacks using Toxiproxy.

        The Semian readme recommends to concatenate the host and port of an HTTP endpoint to create the identifier for the resource being protected. Worldwide payment processing typically uses a single endpoint, but often payment gateways use local acquirers behind the scenes to optimize authorization rates and lower costs. For Shopify Payments credit card transactions, we add the merchant's country code to the endpoint host and port to create a more fine-grained Semian identifier, so an open circuit due to an local outage in one country doesn’t affect transactions for merchants in other countries.

        Semian and other circuit breaker implementations aren’t a silver bullet that will solve all your resiliency problems by adding it to your application. It requires understanding the ways your application can fail and what falling back could look like. At scale a circuit breaker can still waste a lot of resources (and money) as well. The article Your Circuit Breaker is Misconfigured explains how to fine tune this pattern for maximum performance.

        3. Understand Capacity

        Understanding a bit of queueing theory goes a long way in being able to reason about how a system will behave under load. Slightly summarized, Little’s Law states that “the average number of customers in a system (over an interval) is equal to their average arrival rate, multiplied by their average time in the system.” The arrival rate is the amount of customers entering and leaving the system.

        Some might not realize it at first, but queues are everywhere: in grocery stores, traffic, factories, and as I recently rediscovered, at a festival in front of the toilets. Jokes aside, you find queues in online applications as well. A background job, a Kafka event, and a web request all are examples of units of work processed on queues. Put in a formula, Little’s Law is expressed as capacity = throughput * latency. This also means that throughput = capacity / latency. Or in more practical terms: if we have 50 requests arrive in our queue and it takes an average of 100 milliseconds to process a request, our throughput is 500 requests per second.

        With the relationship between queue size, throughput, and latency clarified, we can reason about what changing any of the variables implies. An N+1 query increases the latency of a request and lowers our throughput. If the amount of requests coming in exceeds our capacity, the requests queue grows and at some point a client is waiting so long for their request to be served that they time out. At some point you need to put a limit on the amount of work coming in—your application can’t out scale the world. Rate limiting and load shedding are two techniques for this.

        4. Add Monitoring and Alerting

        With our newfound understanding of queues, we now have a better idea of what kind of metrics we need to monitor to know our system is at risk of going down due to overload. Google’s site reliability engineering (SRE) book lists four golden signals a user-facing system should be monitored for:

        • Latency: the amount of time it takes to process a unit of work, broken down between success and failures. With circuit breakers failures can happen very fast and lead to misleading graphs.
        • Traffic: the rate in which new work comes into the system, typically expressed in requests per minute.
        • Errors: the rate of unexpected things happening. In payments, we distinguish between payment failures and errors. An example of a failure is a charge being declined due to insufficient funds, which isn’t unexpected at all. HTTP 500 response codes from our financial partners on the other hand are. However a sudden increase in failures might need further investigation.
        • Saturation: how much load the system is under, relative to its total capacity. This could be the amount of memory used versus available or a thread pool’s active threads versus total number of threads available, in any layer of the system.

        5. Implement Structured Logging

        Where metrics provide a high-level overview of how our system is behaving, logging allows us to understand what happened inside a single web request or background job. Out of the box Ruby on Rails logs are human-friendly but hard to parse for machines. This can work okay if you have only a single application server, but beyond that you’ll quickly want to store logs in a centralized place and make them easily searchable. This is done by using structured logging in a machine readable format, like key=value pairs or JSON, allows log aggregation systems to parse and index the data. 

        In distributed systems, it’s useful to pass along some sort of correlation identifier. A hypothetical example is when a buyer initiates a payment at checkout, a correlation_id is generated by our Rails controller. This identifier is passed along to a background job that makes the API call to the payment service that handles sensitive credit card data, which contains the correlation identifier in the API parameters and SQL query comments. Because these components of our checkout process all log the correlation_id, we can easily find all related logs when we need to debug this payment attempt.

        6. Use Idempotency Keys

        Distributed systems use unreliable networks, even if the networks look reliable most of the time. At Shopify’s scale, a once in a million chance of something unreliable occurring during payment processing means it’s happening many times a day. If this is a payments API call that timed out, we want to retry the request, but do so safely. Double charging a customer's card isn’t just annoying for the card holder, it also opens up the merchant for a potential chargeback if they don’t notice the double charge and refund it. A double refund isn’t good for the merchant's business either.

        In short, we want a payment or refund to happen exactly once despite the occasional hiccups that could lead to sending an API request more than once. Our centralized payment service can track attempts, which consists of at least one or more (retried) identical API requests, by sending an idempotency key that’s unique for each one. The idempotency key looks up the steps the attempt completed (such as creating a local database record of the transaction) and makes sure we send only a single request to our financial partners. If any of these steps fail and a retried request with the same idempotency key is received, recovery steps are run to recreate the same state before continuing. Building Resilient GraphQL APIs Using Idempotency describes how our idempotency mechanism works in more detail.

        An idempotency key needs to be unique for the time we want the request to be retryable, typically 24 hours or less. We prefer using an Universally Unique Lexicographically Sortable Identifier (ULID) for these idempotency keys instead of a random version 4 UUID. ULIDs contain a 48-bit timestamp followed by 80 bits of random data. The timestamp allows ULIDs to be sorted, which works much better with the b-tree data structure databases use for indexing. In one high-throughput system at Shopify we’ve seen a 50 percent decrease in INSERT statement duration by switching from UUIDv4 to ULID for idempotency keys.

        7. Be Consistent With Reconciliation

        With reconciliation we make sure that our records are consistent with those of our financial partners. We reconcile individual records such as charges or refunds, and aggregates such as the current balance not yet paid out to a merchant. Having accurate records isn’t just for display purposes, they’re also used as input for tax forms were required to generate for merchants in some jurisdictions.

        In case of a mismatch, we record the anomaly in our database. An example is the MismatchCaptureStatusAnomaly, expressing the status of a captured local charge wasn’t the same as the status as returned by our financial partners. Often we can automatically attempt to remediate the discrepancy and mark the anomaly as resolved. In cases where this isn’t possible, the developer team investigates anomalies and ships fixes as necessary.

        Even though we attempt automatic fixes where possible, we want to keep track of the mismatch so we know what our system did and how often. We should rely on anomalies to fix things as a last resort, preferring solutions that prevent anomalies from being created in the first place.

        8. Incorporate Load testing

        While Little’s Law is a useful theorem, practice is messier: the processing time for work isn’t uniformly distributed, making it impossible to achieve 100% saturation. In practice, queue size starts growing somewhere around the 70 to 80 percent mark, and if the time spent waiting in the queue exceeds the client timeout, from the client’s perspective our service is down. If the volume of incoming work is large enough, our servers can even run out of memory to store work on the queue and crash.

        There are various ways we can keep queue size under control. For example, we use scriptable load balancers to throttle the amount of checkouts happening at any given time. In order to provide a good user experience for buyers, if the amount of buyers wanting to check out exceeds our capacity, we place these buyers on a waiting queue (I told you they are everywhere!) before allowing them to pay for their order. Surviving Flashes of High-Write Traffic Using Scriptable Load Balancers describes this system in more detail.

        We regularly test the limits and protection mechanisms of our systems by simulating large volume flash sales on specifically set up benchmark stores. Pummelling the Platform–Performance Testing Shopify describes our load testing tooling and philosophy. Specifically for load testing payments end-to-end, we have a bit of a problem: the test and staging environments of our financial partners don’t have the same capacity or latency distribution as production. To solve this, our benchmark stores are configured with a special benchmark gateway whose responses mimic these properties.

        9. Get on Top of Incident Management

        As mentioned at the start of this article, we know that failure can’t be completely avoided and is a situation that we need to prepare for. An incident usually starts when the oncall service owners get paged, either by an automatic alert based on monitoring or by hand if someone notices a problem. Once the problem is confirmed, we start the incident process with a command sent to our Slack bot spy

        The conversation moves to the assigned incident channel where we have three roles involved:

        • Incident Manager on Call (IMOC): responsible for coordinating the incident
        • Support Response Manager (SRM): responsible for public communication 
        • the service owner(s): who are working on restoring stability.

        The article Implementing ChatOps into our Incident Management Procedure goes into more detail about the process. Once the problem has been mitigated, the incident is stopped, and the Slack bot generates a Service Disruption in our services database application. The disruption contains an initial timeline of events, Slack messages marked as important, and a list of people involved.

        10. Organize Incident Retrospectives

        We aim to have an incident retrospective meeting within a week after the incident occurred. During this meeting:

        • we dig deep into what exactly happened
        • what incorrect assumptions we held about our systems
        • what we can do to prevent the same thing from happening again. 

        Once these things are understood, typically a few action items are assigned to implement safeguards to prevent the same thing from happening again.

        Retrospectives aren’t just good for trying to prevent problems, they’re also a valuable learning tool for new members of the team. At Shopify, the details of every incident are internally public for all employees to learn from. A well-documented incident can also be a training tool for newer members joining the team on call rotation, either as an archived document to refer to or by creating a disaster role playing scenario from it.

        Scratching the Surface

        I moved from my native Netherlands to Canada for this job in 2016, before Shopify became a Digital by Design company. During my work, I’m often reminded of this Dutch saying “trust arrives on foot, but leaves on horseback.” Merchants’ livelihoods are dependent on us if they pick Shopify Payments for accepting payments online or in-person, and we take that responsibility seriously. While failure isn’t completely avoidable, there are many concepts and techniques that we apply to minimize downtime, limit the scope of impact, and build applications that are resilient to failure.

        This top ten only scratches the tip of the iceberg, it was meant as an introduction to the kind of challenges the Shopify Payments team deals with after all. I usually recommend Release It! by Michael Nygard as a good resource for team members who want to learn more.

        Bart is a staff developer on the Shopify Payments team and has been working on the scalability, reliability, and security of Shopify’s payment processing infrastructure since 2016.


        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        Data-Centric Machine Learning: Building Shopify Inbox’s Message Classification Model

        Data-Centric Machine Learning: Building Shopify Inbox’s Message Classification Model

        By Eric Fung and Diego Castañeda

        Shopify Inbox is a single business chat app that manages all Shopify merchants’ customer communications in one place, and turns chats into conversions. As we were building the product it was essential for us to understand how our merchants’ customers were using chat applications. Were they reaching out looking for product recommendations? Wondering if an item would ship to their destination? Or were they just saying hello? With this information we could help merchants prioritize responses that would convert into sales and guide our product team on what functionality to build next. However, with millions of unique messages exchanged in Shopify Inbox per month, this was going to be a challenging natural language processing (NLP) task. 

        Our team didn’t need to start from scratch, though: off-the-shelf NLP models are widely available to everyone. With this in mind, we decided to apply a newly popular machine learning process—the data-centric approach. We wanted to focus on fine-tuning these pre-trained models on our own data to yield the highest model accuracy, and deliver the best experience for our merchants.

        A merchant’s Shopify Inbox screen titled Customers that displays snippets of messages from customers that are labelled with things for easy identification like product details, checkout, and edit order.
        Message Classification in Shopify Inbox

        We’ll share our journey of building a message classification model for Shopify Inbox by applying the data-centric approach. From defining our classification taxonomy to carefully training our annotators on labeling, we dive into how a data-centric approach, coupled with a state-of-the-art pre-trained model, led to a very accurate prediction service we’re now running in production.

        Why a Data-Centric Approach?

        A traditional development model for machine learning begins with obtaining training data, then successively trying different model architectures to overcome any poor data points. This model-centric process is typically followed by researchers looking to advance the state-of-the-art, or by those who don't have the resources to clean a crowd-sourced dataset.

        By contrast, a data-centric approach focuses on iteratively making the training data better to reduce inconsistencies, thereby yielding better results for a range of models. Since anyone can download a well-performing, pre-trained model, getting a quality dataset is the key differentiator in being able to produce a high-quality system. At Shopify, we believe that better training data yields machine learning models that can serve our merchants better. If you’re interested in hearing more about the benefits of the data-centric approach, check out Andrew Ng’s talk on MLOps: From Model-centric to Data-centric.

        Our First Prototype

        Our first step was to build an internal prototype that we could ship quickly. Why? We wanted to build something that would enable us to understand what buyers were saying. It didn’t have to be perfect or complex, it just had to prove that we could deliver something with impact. We could iterate afterwards. 

        For our first prototype, we didn't want to spend a lot of time on the exploration, so we had to construct both the model and training data with limited resources. Our team chose a pre-trained model available on TensorFlow Hub called Universal Sentence Encoder. This model can output embeddings for whole sentences while taking into account the order of words. This is crucial for understanding meaning. For example, the two messages below use the same set of words, but they have very different sentiments:

        • “Love! More please. Don’t stop baking these cookies.”
        • “Please stop baking more cookies! Don’t love these.”

        To rapidly build our training dataset, we sought to identify groups of messages with similar meaning, using various dimensionality reduction and clustering techniques, including UMAP and HDBScan. After manually assigning topics to around 20 message clusters, we applied a semi-supervised technique. This approach takes a small amount of labeled data, combined with a larger amount of unlabeled data. We hand-labeled a few representative seed messages from each topic, and used them to find additional examples that were similar. For instance, given a seed message of “Can you help me order?”, we used the embeddings to help us find similar messages such as “How to order?” and “How can I get my orders?”. We then sampled from these to iteratively build the training data.

        A visualization using a scatter plot of message clusters during one of our explorations.
        A visualization of message clusters during one of our explorations.

        We used this dataset to train a simple predictive model containing an embedding layer, followed by two fully connected, dense layers. Our last layer contained the logits array for the number of classes to predict on.

        This model gave us some interesting insights. For example, we observed that a lot of chat messages are about the status of an order. This helped inform our decision to build an order status request as part of Shopify Inbox’s Instant Answers FAQ feature. However, our internal prototype had a lot of room for improvement. Overall, our model achieved a 70 percent accuracy rate and could only classify 35 percent of all messages with high confidence (what we call coverage). While our scrappy approach of using embeddings to label messages was fast, the labels weren’t always the ground truth for each message. Clearly, we had some work to do.

        We know that our merchants have busy lives and want to respond quickly to buyer messages, so we needed to increase the accuracy, coverage, and speed for version 2.0. Wanting to follow a data-centric approach, we focused on how we could improve our data to improve our performance. We made the decision to put additional effort into defining the training data by re-visiting the message labels, while also getting help to manually annotate more messages. We sought to do all of this in a more systematic way.

        Creating A New Taxonomy

        First, we dug deeper into the topics and message clusters used to train our prototype. We found several broad topics containing hundreds of examples that conflated distinct semantic meanings. For example, messages asking about shipping availability to various destinations (pre-purchase) were grouped in the same topic as those asking about what the status of an order was (post-purchase).

        Other topics had very few examples, while a large number of messages didn’t belong to any specific topic at all. It’s no wonder that a model trained on such a highly unbalanced dataset wasn’t able to achieve high accuracy or coverage.

        We needed a new labeling system that would be accurate and useful for our merchants. It also had to be unambiguous and easy to understand by annotators, so that labels would be applied consistently. A win-win for everybody!

        This got us thinking: who could help us with the taxonomy definition and the annotation process? Fortunately, we have a talented group of colleagues. We worked with our staff content designer and product researcher who have domain expertise in Shopify Inbox. We were also able to secure part-time help from a group of support advisors who deeply understand Shopify and our merchants (and by extension, their buyers).

        Over a period of two months, we got to work sifting through hundreds of messages and came up with a new taxonomy. We listed each new topic in a spreadsheet, along with a detailed description, cross-references, disambiguations, and sample messages. This document would serve as the source of truth for everyone in the project (data scientists, software engineers, and annotators).

        In parallel with the taxonomy work, we also looked at the latest pre-trained NLP models, with the aim of fine-tuning one of them for our needs. The Transformer family is one of the most popular, and we were already using that architecture in our product categorization model. We settled on DistilBERT, a model that promised a good balance between performance, resource usage, and accuracy. Some prototyping on a small dataset built from our nascent taxonomy was very promising: the model was already performing better than version 1.0, so we decided to double down on obtaining a high-quality, labeled dataset.

        Our final taxonomy contained more than 40 topics, grouped under five categories: 

        • Products
        • Pre-Purchase
        • Post-Purchase
        • Store
        • Miscellaneous

        We arrived at this hierarchy by thinking about how an annotator might approach classifying a message, viewed through the lens of a buyer. The first thing to determine is: where was the buyer on their purchase journey when the message was sent? Were they asking about a detail of the product, like its color or size? Or, was the buyer inquiring about payment methods? Or, maybe the product was broken, and they wanted a refund? Identifying the category helped to narrow down our topic list during the annotation process.

        Our in-house annotation tool displaying the message to classify at top of screen with a list possible topics grouped by category below
        Our in-house annotation tool displaying the message to classify, along with some of the possible topics, grouped by category

        Each category contains an other topic to group the messages that don’t have enough content to be clearly associated with a specific topic. We decided to not train the model with the examples classified as other because, by definition, they were messages we couldn’t classify ourselves in the proposed taxonomy. In production, these messages get classified by the model with low probabilities. By setting a probability threshold on every topic in the taxonomy, we could decide later on whether to ignore them or not.

        Since this taxonomy was pretty large, we wanted to make sure that everyone interpreted it consistently. We held several training sessions with our annotation team to describe our classification project and philosophy. We divided the annotators into two groups so they could annotate the same set of messages using our taxonomy. This exercise had a two-fold benefit:

        1. It gave annotators first-hand experience using our in-house annotation tool.
        2. It allowed us to measure inter-annotator agreement.

        This process was time-consuming as we needed to do several rounds of exercises. But, the training led us to refine the taxonomy itself by eliminating inconsistencies, clarifying descriptions, adding additional examples, and adding or removing topics. It also gave us reassurance that the annotators were aligned on the task of classifying messages.

        Let The Annotation Begin

        Once we and the annotators felt that they were ready, the group began to annotate messages. We set up a Slack channel for everyone to collaborate and work through tricky messages as they arose. This allowed everyone to see the thought process used to arrive at a classification.

        During the preprocessing of training data, we discarded single-character messages and messages consisting of only emojis. During the annotation phase, we excluded other kinds of noise from our training data. Annotators also flagged content that wasn’t actually a message typed by a buyer, such as when a buyer cut-and-pastes the body of an email they’ve received from a Shopify store confirming their purchase. As the old saying goes, garbage in, garbage out. Lastly, due to our current scope and resource constraints, we had to set aside non-English messages.

        Handling Sensitive Information

        You might be wondering how we dealt with personal information (PI) like emails or phone numbers. PI occasionally shows up in buyer messages and we took special care to ensure that it was handled appropriately. This was a complicated, and at times manual, process that involved many steps and tools.

        To avoid training our machine learning model on any messages containing PI, we couldn’t just ignore them. That would likely bias our model. Instead, we wanted to identify the messages with PI, then replace it with realistic, mock data. In this way, we would have examples of real messages that wouldn’t be identifiable to any real person.

        This anonymization process began with our annotators flagging messages containing PI. Next, we used an open-source library called Presidio to analyze and anonymize the PI. This tool ran in our data warehouse, keeping our merchants’ data within Shopify’s systems. Presidio is able to recognize many different types of PI, and the anonymizer provides different kinds of operators that can transform the instances of PI into something else. For example, you could completely remove it, mask part of it, or replace it with something else.

        In our case, we used another open-source tool called Faker to replace the PI. This library is customizable and localized, and its providers can generate realistic addresses, names, locations, URLs, and more. Here’s an example of its Python API:

        Combining Presidio and Faker allowed us to semi-automate the PI replacement, see below for a fabricated example:

        Original

        can i pickup today? i ordered this am: Sahar Singh my phone is 852 5555 1234. Email is saharsingh@example.com

        Anonymized

        can i pickup today? i ordered this am: Sahar Singh my phone is 090-722-7549. Email is osamusato@yoshida.jp

         

        If you’re a sharp eyed reader, you’ll notice (as we did), that our tools missed identifying a bit of fabricated PI in the above example (hint: the name). Despite Presidio using a variety of techniques (regular expressions, named entity recognition, and checksums), some PI slipped through the cracks. Names and addresses have a lot of variability and are hard to recognize reliably. This meant that we still needed to inspect the before and after output to identify whether any PI was still present. Any PI was manually replaced with a placeholder (for example, the name Sahar Singh was replaced with <PERSON>). Finally, we ran another script to replace the placeholders with Faker-generated data.

        A Little Help From The Trends

        Towards the end of our annotation project, we noticed a trend that persisted throughout the campaign: some topics in our taxonomy were overrepresented in the training data. It turns out that buyers ask a lot of questions about products!

        Our annotators had already gone through thousands of messages. We couldn’t afford to split up the topics with the most popular messages and re-classify them, but how could we ensure our model performed well on the minority classes? We needed to get more training examples from the underrepresented topics.

        Since we were continuously training a model on the labeled messages as they became available, we decided to use it to help us find additional messages. Using the model’s predictions, we excluded any messages classified with the overrepresented topics. The remaining examples belonged to the other topics, or were ones that the model was uncertain about. These messages were then manually labeled by our annotators.

        Results

        So, after all of this effort to create a high-quality, consistently labeled dataset, what was the outcome? How did it compare to our first prototype? Not bad at all. We achieved our goal of higher accuracy and coverage:

        Metric

        Version 1.0 Prototype

        Version 2.0 in Production

        Size of training set

        40,000

        20,000

        Annotation strategy

        Based on embedding similarity

        Human labeled

        Taxonomy classes

        20

        45

        Model accuracy

        ~70%

        ~90%

        High confidence coverage

        ~35%

        ~80%

         

        Another key part of our success was working collaboratively with diverse subject matter experts. Bringing in our support advisors, staff content designer, and product researcher provided perspectives and expertise that we as data scientists couldn’t achieve alone.

        While we shipped something we’re proud of, our work isn’t done. This is a living project that will require continued development. As trends and sentiments change over time, the topics of conversations happening in Shopify Inbox will shift accordingly. We’ll need to keep our taxonomy and training data up-to-date and create new models to continue to keep our standards high.

        If you want to learn more about the data work behind Shopify Inbox, check out Building a Real-time Buyer Signal Data Pipeline for Shopify Inbox that details the real-time buyer signal data pipeline we built.

        Eric Fung is a senior data scientist on the Messaging team at Shopify. He loves baking and will always pick something he’s never tried at a restaurant. Follow Eric on Twitter.

        Diego Castañeda is a senior data scientist on the Core Optimize Data team at Shopify. Previously, he was part of the Messaging team and helped create machine learning powered features for Inbox. He loves computers, astronomy and soccer. Connect with Diego on LinkedIn.


        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Data Science & Engineering career page to find out about our open positions. Learn about how we’re hiring to design the future together—a future that is Digital by Design.

        Continue reading

        Spin Infrastructure Adventures: Containers, Systemd, and CGroups

        Spin Infrastructure Adventures: Containers, Systemd, and CGroups

        The Spin infrastructure team works hard at improving the stability of the system. In February 2022 we moved to Container Optimized OS (COS), the Google maintained operating system for their Kubernetes Engine SaaS offering. A month later we turned on multi-cluster to allow for increased scalability as more users came on board. Recently, we’ve increased default resources allotted to instances dramatically. However, with all these changes we’re still experiencing some issues, and for one of those, I wanted to dive a bit deeper in a post to share with you.

        Spin’s Basic Building Blocks

        First it's important to know the basic building blocks of Spin and how these systems interact. The Spin infrastructure is built on top of Kubernetes, using many of the same components that Shopify’s production applications use. Spin instances themselves are implemented via a custom resource controller that we install on the system during creation. Among other things, the controller transforms the Instance custom resource into a pod that’s booted from a special Isospin container image along with the configuration supplied during instance creation. Inside the container we utilize systemd as a process manager and workflow engine to initialize the environment, including installing dotfiles, pulling application source code, and running through bootstrap scripts. Systemd is vital because it enables a structured way to manage system initialization and this is used heavily by Spin.

        There’s definitely more to what makes Spin then what I’ve described, but from a high level and for the purposes of understanding the technical challenges ahead it's important to remember that:

        1. Spin is built on Kubernetes
        2. Instances are run in a container
        3. systemd is run INSIDE the container to manage the environment.

        First Encounter

        In February 2022, we had a serious problem with Pod relocations that we eventually tracked to node instability. We had several nodes in our Kubernetes clusters that would randomly fail and require either a reboot or to be replaced entirely. Google had decent automation for this that would catch nodes in a bad state and replace them automatically, but it was occurring often enough (five nodes per day or about one percent of all nodes) that users began to notice. Through various discussions with Shopify’s engineering infrastructure support team and Google Cloud support we eventually honed in on memory consumption being the primary issue. Specifically, nodes were running out of memory and pods were being out of memory (OOM) killed as a result. At first, this didn’t seem so suspicious, we gave users the ability to do whatever they want inside their containers and didn’t provide much resources to them (8 to 12 GB of RAM each), so it was a natural assumption that containers were, rightfully, just using too many resources. However, we found some extra information that made us think otherwise.

        First, the containers being OOM killed would occasionally be the only Spin instance on the node and when we looked at their memory usage, often it would be below the memory limit allotted to them.

        Second, in parallel to this another engineer was investigating an issue with respect to Kafka performance where he identified a healthy running instance using far more resources than should have been possible.

        The first issue would eventually be connected to a memory leak that the host node was experiencing, and through some trial and error we found that switching the host OS from Ubuntu to Container Optimized OS from Google solved it. The second issue remained a mystery. With the rollout of COS though, we saw 100 times reduction in OOM kills, which was sufficient for our goals and we began to direct our attention to other priorities.

        Second Encounter

        Fast forward a few months to May 2022. We were experiencing better stability which was a source of relief for the Spin team. Our ATC rotations weren't significantly less frantic, the infrastructure team had the chance to roll out important improvements including multi-cluster support and a whole new snapshotting process. Overall things felt much better.

        Slowly but surely over the course of a few weeks, we started to see increased reports of instance instability. We verified that the nodes weren’t leaking memory as before, so it wasn’t a regression. This is when several team members re-discovered the excess memory usage issue we’d seen before, but this time we decided to dive a little further.

        We needed a clean environment to do the analysis, so we set up a new spin instance on its own node. During our test, we monitored the Pod resource usage and the resource usage of the node it was running on. We used kubectl top pod and kubectl top node to do this. Before we performed any tests we saw

        Next, we needed to simulate memory load inside of the container. We opted to use a tool called stress, allowing us to start a process that consumes a specified amount of memory that we could use to exercise the system.

        We ran kubectl exec -it spin-muhc – bash to land inside of a shell in the container and then stress -m 1 --vm-bytes 10G --vm-hang 0 to start the test.

        Checking the resource usage again we saw

        This was great, exactly what we expected. The 10GB used by our stress test showed up in our metrics. Also, when we checked the cgroup assigned to the process we saw it was correctly assigned to the Kubernetes Pod:

        Where 24899 was the PID of the process started by stress.This looked great as well. Next, we performed the same test, but in the instance environment accessed via spin shell. Checking the resource usage we saw

        Now this was odd. Here we saw that the memory created by stress wasn’t showing up under the Pod stats (still only 14Mi), but it was showing up for the node (33504Mi). Checking the usage from in the container we saw that it was indeed holding onto memory as expected

        However, when we checked the cgroup this time, we saw something new:

        What the heck!? Why was the cgroup different? We double checked that this was the correct hierarchy by using the systemd cgroup list tool from within the spin instance: 

        So to summarize what we had seen: 

        1. When we run processes inside the container via kubectl exec, they’re correctly placed within the kubepods cgroup hierarchy. This is the hierarchy that contains the pods memory limits.
        2. When we run the same processes inside the container via spin shell, they’re placed within a cgroup hierarchy that doesn’t contain the limits. We verify this by checking the cgroup file directly:

        The value above is close to the maximum value of a 64 bit integer (about 8.5 Billion Gigabytes of memory). Needless to say, our system has less than that, so this is effectively unlimited.

        For practical purposes, this means any resource limitation we put on the Pod that runs Spin instances isn’t being honored. So Spin instances can use more memory than they’re allotted which is concerning for a few reasons, but probably most importantly, we depend on this to avoid instances from interfering with one another.

        Isolating It

        In a complex environment like Spin it’s hard to account for everything that might be affecting the system. Sometimes it’s best to distill problems down to the essential details to properly isolate the issue. We were able to reproduce the cgroup leak in a few different ways. First on the Spin instances directly using crictl or ctr and custom arguments with real Spin instances. Second, running on a local Docker environment . Setting up an experiment like this also allowed for much quicker iteration time when testing potential fixes.

        From the experiments we discovered differences between the different runtimes (containerd, Docker, and Podman) execution of systemd containers. Podman for instance has a --systemd flag that enables and disables an integration with the host systemd. containerd has a similar flag –runc-systemd-cgroup that starts runc with the systemd cgroup manager. For Docker, however, no such integration exists (you can modify the cgroup manager via daemon.json, but not via the CLI like Podman and Containerd) and we saw the same cgroup leakage. When comparing the cgroups assigned to the container processes between Docker and Podman, we saw the following

        Docker

        Podman

         

        Podman placed the systemd and stress processes in a cgroup unique to the container. This allowed Podman to properly delegate the resource limitations to both systemd and any process that systemd spawns. This was the behavior we were looking for!

        The Fix

        We now had an example of a systemd container properly being isolated from the host with Podman. The trouble was that in our Spin production environments we use Kubernetes that uses Containerd, not Podman, for the container runtime. So how could we leverage what we learned from Podman toward a solution?

        While investigating differences between Podman and Docker with respect to Systemd we came across the crux of the fix. By default Docker and containerd use a cgroup driver called cgroupfs to manage the allocation of resources while Podman uses the systemd driver (this is specific to our host operating system COS from Google). The systemd driver delegates responsibility of cgroup management to the host systemd which then properly manages the delegate systemd that’s running in the container. 

        It’s recommended for nodes running systemd on the host to use the systemd cgroup driver by default, however, COS from Google is still set to use cgroupfs. Checking the developer release notes, we see that in version 101 of COS there is a mention of switching the default cgroup driver to systemd, so the fix is coming!

        What’s Next

        Debugging this issue was an enlightening experience. If you had asked us before, Is it possible for a container to use more resources than its assigned?, we would have said no. But now that we understand more about how containers deliver the sandbox they provide, it’s become clear the answer should have been, It depends.

        Ultimately the escaped permissions were from us bind-mounting /sys/fs/cgroup read-only into the container. A subtle side effect of this, while this directory isn’t writable, all sub directories are. But since this is required of systemd to even boot up, we don’t have the option to remove it. There’s a lot of ongoing work by the container community to get systemd to exist peacefully within containers, but for now we’ll have to make do.

        Acknowledgements

        Special thanks to Daniel Walsh from RedHat for writing so much on the topic. And Josh Heinrichs from the Spin team for investigating the issue and discovering the fix.

        Additional Information

        Chris is an infrastructure engineer with a focus on developer platforms. He’s also a member of the ServiceMeshCon program committee and a @Linkerd ambassador.


        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        Shopify and Open Source: A Mutually Beneficial Relationship

        Shopify and Open Source: A Mutually Beneficial Relationship

        Shopify and Rails have grown up together. Both were in their infancy in 2004, and our CEO (Tobi) was one of the first contributors and a member of Rails Core. Shopify was built on top of Rails, and our engineering culture is rooted in the Rails Doctrine, from developer happiness to the omakase menu, sharp knives, and majestic monoliths. We embody the doctrine pillars. 

        Shopify's success is due, in part, to Ruby and Rails. We feel obligated to pay that success back to the community as best we can. But our commitment and investment are about more than just paying off that debt; we have a more meaningful and mutually beneficial goal.

        One Hundred Year Mission

        At Shopify, we often talk about aspiring to be a 100-year company–to still be around in 2122! That feels like an ambitious dream, but we make decisions and build our code so that it scales as we grow with that goal in mind. If we pull that off, will Ruby and Rails still be our tech stack? It's hard to answer, but it's part of my job to think about that tech stack over the next 100 years.

        Ruby and Rails as 100-year tools? What does that even mean?

        To get to 100 years, Rails has to be more than an easy way to get started on a new project. It's about cost-effective performance in production, well-formed opinions on the application architecture, easy upgrades, great editors, avoiding antipatterns, and choosing when you want the benefits of typing. 

        To get to 100 years, Ruby and Rails have to merit being the tool of choice, every day, for large teams and well-aged projects for a hundred years. They have to be the tool of choice for thousands of developers, across millions of lines of code, handling billions of web requests. That's the vision. That's Rails at scale.

        And that scale is where Shopify is investing.

        Why Companies Should Invest In Open Source

        Open source is the heart and soul of Rails: I’d say that Rails would be nowhere near what it is today if not for the open source community.

        Rafael França, Shopify Principal Engineer and Rails Core Team Member

        We invest in open source to build the most stable, resilient, performant version to grow our applications. How much better could it be if more people were contributing? As a community, we can do more. Ruby and Rails can only continue to be a choice for companies if we're actively investing in the development, and to do that; we need more companies involved in contributing.

        It Improves Engineering Skills

        Practice makes progress! Building open source software with cross-functional teams helps build better communication skills and offers opportunities to navigate feedback and criticism constructively. It also enables you to flex your debugging muscles and develop deep expertise in how the framework functions, which helps you build better, more stable applications for your company.

        It’s Essential to Application Health & Longevity

        Contributing to open source helps ensure that Rails benefits your application and the company in the long term. We contribute because we care about the changes and how they affect our applications. Investing upfront in the foundation is proactive, whereas rewrites and monkey patches are reactive and lead to brittle code that's hard to maintain and upgrade.

        At our scale, it's common to find issues with, or opportunities to enhance, the software we use. Why keep those improvements private? Because we build on open source software, it makes sense to contribute to those projects to ensure that they will be as great as possible for as long as possible. If we contribute to the community, it increases our influence on the software that our success is built on and helps improve our chances of becoming a 100-year company. This is why we make contributions to Ruby and Rails, and other open source projects. The commitment and investment are significant, but so are the benefits.

        How We're Investing in Ruby and Rails

        Shopify is built on a foundation of open source software, and we want to ensure that that foundation continues to thrive for years to come and that it continues to scale to meet our requirements. That foundation can’t succeed without investment and contribution from developers and companies. We don’t believe that open source development is “someone else’s problem”. We are committed to Ruby and Rails projects because the investment helps us future-proof our foundation and, therefore, Shopify. 

        We contribute to strategic projects and invest in initiatives that impact developer experience, performance, and security—not just for Shopify but for the greater community. Here are some projects we’re investing in:

        Improving Developer Tooling 

        • We’ve open-sourced projects like toxiproxy, bootsnap, packwerk, tapioca, paquito, and maintenance_tasks that are niche tools we found we needed. If we need them, other developers likely need them as well.
        • We helped add Rails support to Sorbet's gradual typing to make typing better for everyone.
        • We're working to make Ruby support in VScode best-in-class with pre-configured extensions and powerful features like refactorings.
        • We're working on automating upgrades between Ruby and Rails versions to reduce friction for developers.

        Increasing Performance

        Enhancing Security

        • We're actively contributing to bundler and rubygems to make Ruby's supply chain best-in-class.
        • We're partnering with Ruby Central to ensure the long-term success and security of Rubygems.org through strategic investments in engineering, security-related projects, critical tools and libraries, and improving the cycle time for contributors.

        Meet Shopify Contributors

        The biggest investment you can make is to be directly involved in the future of the tools that your company relies on. We believe we are all responsible for the sustainability and quality of open source. Shopify engineers are encouraged to contribute to open source projects where possible. The commitment varies. Some engineers make occasional contributions, some are part-time maintainers of important open source libraries that we depend on, and some are full-time contributors to critical open source projects.

        Meet some of the Shopify engineers contributing to open source. Some of those faces are probably familiar because we have some well-known experts on the team. But some you might not know…yet. We're growing the next generation of Ruby and Rails experts to build for the future.

        Mike is a NYC-based engineering leader who's worked in a variety of domains, including energy management systems, bond pricing, high-performance computing, agile consulting, and cloud computing platforms. He is an active member of the Ruby open-source community, where as a maintainer of a few popular libraries he occasionally still gets to write software. Mike has spent the past decade growing inclusive engineering organizations and delivering amazing software products for Pivotal, VMware, and Shopify.


        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        The Story Behind Shopify’s Isospin Tooling

        The Story Behind Shopify’s Isospin Tooling

        You may have read that Shopify has built an in-house cloud development platform named Spin. In that post, we covered the history of the platform and how it powers our everyday work. In this post, we’ll take a deeper dive into one specific aspect of Spin: Isospin, Shopify’s systemd-based tooling that forms the core of how we run applications within Spin.

        The initial implementation of Spin used the time-honored POSS (Pile of Shell Scripts) design pattern. As we moved to a model where all of our applications ran in a single Linux VM, we were quickly outgrowing our tooling⁠—not to mention the added complexity of managing multiple applications within a single machine. Decisions such as what dependency services to run, in what part of the boot process, and how many copies to run became much more difficult as we ran many applications together within the same instance. Specifically, we needed a way to:

        • split up an application into its component parts
        • specify the dependencies between those parts
        • have those jobs be scheduled at the appropriate times
        • isolate services and processes from each other.

        At a certain point, stepping back, an obvious answer began to emerge. The needs we were describing weren’t merely solvable, they were already solved—by something we were already using. We were describing services, the same as any other services run by the OS. There were already tools to solve this built right into the OS. Why not leverage that?

        A Lightning Tour of systemd

        systemd’s service management works by dividing the system into a graph of units representing individual units or jobs. Each unit can specify its dependencies on other units, in granular detail, allowing systemd to determine an order to launch services in order to bring the system up and reason about cascading failure states. In addition to units representing actual services, it supports targets, which represent an abstract grouping of one or more units. Targets can have dependencies of their own and be depended on by other units, but perform no actual work. By specifying targets representing phases of the boot process and a top-level target representing the desired state of the system, systemd can quickly and comprehensively prepare services to run.

        systemd has several features which enable dynamic generation of units. Since we were injecting multiple apps into a system at runtime, with varying dependencies and processes, we made heavy use of these features to enable us to create complex systemd service graphs on the fly.

        The first of these is template unit files. Ordinarily, systemd namespaces units via their names; any service named foo will satisfy a dependency on the service named foo, and only one instance of a unit with a name can be running at once. This was obviously not ideal for us, since we have many services that we’re running per-application. Template unit files expand this distinction a bit by allowing a service to take a parameter that becomes part of its namespace. For example, a service named foo@.service could take the argument bar, running as foo@bar. This allows multiple copies of the same service to run simultaneously. The parameter is also available within the unit as a variable, allowing us to namespace runtime directories and other values with the same parameter.

        Template units were key to us since not only do they allow us to share service definitions for applications themselves, they allow us to run multiple copies of dependency services. In order to maintain full isolation between applications—and to simulate the separately-networked services they would be talking to in production—neighbor apps within a single Isospin VM don’t use the same installation of core services such as MySQL or Elasticsearch. Instead, we run one copy of these services for each app that needs it. Template units simplified this process greatly and via a single service definition, we simply run as many copies of each as we need.

        We also made use of generators, a systemd feature that allows dynamically creating units at runtime. This was useful for us since the dynamic state of our system meant that a fixed service order wasn’t really feasible. There were two primary features of Isospin’s setup that complicated things:

        1. Which app or apps to run in the system isn’t fixed, but rather is assigned when we boot a system. Thus, via information assigned at the time the system is booted, we need to choose which top-level services to enable.

        2. While many of the Spin-specific services are run for every app, dependencies on other services are dynamic. Not every app requires MySQL or Elasticsearch or so on. We needed a way to specify these systemd-level dependencies dynamically.

        Generators provided a simple way to run this. Early in the bootup process, we run a generator that creates a target named spin-app for each app to be run in the system. That target contains all of the top-level dependencies an app needs to run, and is then assigned as a dependency of the system is running target. Despite sounding complex, this requires no more than a 28-line bash script and a simple template file for the service. Likewise, we’re able to assign the appropriate dependency services as requirements of this spin-app target via another generator that runs later in the process.

        Booting Up an Example

        To help understand how this works in action, let’s walk through an example of the Isospin boot process.

        We start by creating a target named spin.target, which we use to represent Spin has finished booting. We’ll use this target later to determine whether or not the system has successfully finished starting up. We then run a generator named apps that checks the configuration to see which apps we’ve specified for the system. It then generates new dependencies on the spin-app@ target, requesting one instance per application and passing in the name of the application as its parameter.

        spin-app@ depends on several of the core services that represent a fully available Spin application, including several more generators. Via those dependencies, we run the spin-svcs@ generator to determine which system-level service dependencies to inject, such as MySQL or Elasticsearch. We also run the spin-procs@ generator that determines which command or commands to run the application itself and generates one service per command.

        Finally, we bring the app up via the spin-init@ service and its dependencies. spin-init@ represents the final state of bootstrapping necessary for the application to be ready to run, and via its recursive dependencies systemd builds out the chain of processes necessary to clone an application’s source, run bootstrap processes, and then run any necessary finalizing tasks before it’s ready to run.

        Additional Tools (and Problems)

        Although the previously described tooling got us very far, we found that we had a few additional problems that required some additional tooling to fix.

        A problem we encountered under this new model was port collision between services. In the past, our apps were able to assume they were the only app in a service, so they could claim a common port for themselves without conflict. Although systemd gave us a lot of process isolation for free, this was a hole we’d dug for ourselves and one we’d need to get out of by ourselves too. 

        The solution we settled on was simple but effective, and one that leveraged a few systemd features to simplify the process. We reasoned that port collision is only a problem because port selection was in the user’s hands. We could solve this by making port assignment the OS’s responsibility. We created a service that handles port assignment programmatically via a hashing process—by taking the service’s name into account we produce a semi-stable automated port assignment that avoids port collision with any other ports we’ve assigned on the system. This service can be used as a dependency of another service that needs to bind to a port and writes the generated port to an environment path that can be used as input to another service to inject environment variables. As long as we specify this as a dependency, we can ensure that the dependent service receives a PORT variable that it’s meant to respect and bind to.

        Another feature that came in handy is systemd’s concept of service readiness. Many process runners, including the Foreman-based solutions we’d been using in the past, have a binary concept of service readiness (either a process is running, or it isn’t) and if a process exits unexpectedly it’s considered failed

        systemd has the same model by default, but it also supports something more complex: it allows configuring a notify socket that allows an application to explicitly communicate its readiness. systemd exposes a Unix datagram socket to the service it’s running via the NOTIFY_SOCKET environment variable. When the underlying app has finished starting up and is ready, it can communicate that status via writing a message to the socket. This granularity helps avoid some of the rare but annoying gotchas with a more simple model of service readiness. It ensures that the service is only considered ready to accept connections when it's actually ready, avoiding a scenario in which external services try sending messages during the startup window. It also avoids a situation where the process remains running but the underlying service has failed during startup.

        Some of the external services we depend on use this, such as MySQL, but we also wrote our own tooling to incorporate it. Our notify-port script is a thin wrapper around web applications that monitors whether the service we’re wrapping has begun accepting HTTP connections over the port Isospin has assigned to it. By polling the port and notifying systemd when it comes up, we’ve been able to catch many real world bugs where services were waiting on the wrong port, and situations in which a server failed on startup while leaving the process alive.

        Isospin on Top

        Although we started out with some relatively simple goals, the more we worked with systemd the more we found ourselves able to leverage its tools to our advantage. By building Isospin on top of systemd, we found ourselves able to save time by reusing pre-existing structures that suited our needs and took advantage of sophisticated tooling for expressing service interdependency and service health. 

        Going forward, we plan to continue expanding on Isospin to express more complex service relationships. For example, we’re investigating the use of systemd service dependencies to allow teams to express that certain parts of their application relies on another team’s application being available.

        Misty De Méo is an engineer who specializes in developer tooling. She’s been a maintainer of the Homebrew package manager since 2011.


        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        How to Build Trust as a New Manager in a Fully Remote Team

        How to Build Trust as a New Manager in a Fully Remote Team

        I had been in the same engineering org for seven years before the pandemic hit. We were a highly collaborative and co-located company and the team were very used to brainstorming and working on physical whiteboards and workspaces. When it did, we moved pretty seamlessly from working together from our office to working from our homes. We didn’t pay too much attention to designing a Digital First culture and neither did we alter our ways of working dramatically. 

        It was only when I joined Shopify last September that I began to realize that working remotely in a company where you have already established trust is very different from starting in a fully remote space and building trust from the ground up. 

        What Is Different About Starting Remotely?

        So, what changes? The one word that comes to mind is intentionality. I would define intentionality as “the act of thoughtfully designing interactions or processes.” A lot of things that happen seamlessly and organically in a real-life setting takes more intentionality in a remote setting. If you deconstruct the process of building trust, you’ll find that in a physical setting trust is built in active ways (words you speak, your actions, and your expertise), but also in passive ways (your body language, demeanor, and casual water cooler talk). In a remote setting, it’s much more difficult to observe, but also build casual, non-transactional relationships with people unless you’re intentional about it. 

        Also, since you’re represented a lot more through your active voice, it’s important to work on setting up a new way of working and gaining mastery over the set of tools and skills that will help build trust and create success in your new environment.

        The 90-Day Plan

        The 90-Day Plan is named after the famous book The First 90 Days written by Michael D. Watkins. Essentially, it breaks down your onboarding journey into three steps:

        1. First 30 days: focus on your environment
        2. First 60 days: focus on your team
        3. 90 days and beyond: focus on yourself.

        First 30 Days: Focus on Your Environment

        Take the time out to think about what kind of workplace you want to create and also to reach out and understand the tone of the wider organization that you are part of.

        Study the Building 

        When you start a new job in a physical location, it’s common to study the office and understand the layout of, not only the building, but also  the company itself. When beginning work remotely, I suggest you start with a metaphorical study of the building. Try and understand the wider context of the organization and the people in it. You can do this with a mix of pairing sessions, one-on-ones, and peer group sessions. These processes help you in gaining technical and organizational context and also build relationships with peers. 

        Set Up the Right Tools

        In an office, many details of workplace setup are abstracted away from you. In a fully digital environment, you need to pay attention to set your workplace up for success. There are plenty of materials available on how to set up your home office (like on lifehack.org and nextplane.net). Ensure that you take the time to set up your remote tools to your taste. 

        Build Relationships 

        If you’re remote, it’s easy to be transactional with people outside of your immediate organization. However, it’s much more fun and rewarding to take the time to build relationships with people from different backgrounds across the company. It gives you a wider context of what the company is doing and the different challenges and opportunities.

        First 60 Days: Focus on Your Team

        Use asynchronous communication for productivity and synchronous for connection.

        Establish Connection and Trust

        When you start leading a remote team, the first thing to do is establish connection and trust. You do this by spending a lot of your time in the initial days meeting your team members and understanding their aspirations and expectations. You should also, if possible, attempt to meet the team once in real life within a few months of starting. 

        Meet in Real Life

        Meeting in real life will help you form deep human relationships and understand them beyond the limited scope of workplace transactions. Once you’ve done this, ensure that you create a mix of synchronous and asynchronous processes within your time. Examples of asynchronous processes are automated dailies, code reviews, receiving feedback, and collaboration on technical design documents. We use synchronous meetings for team retros, coffee sessions, demos, and planning sessions. We try to maximize async productivity and being intentional about the times that you do come together as a team.

        Establishing Psychological Safety

        The important thing about leading a team remotely is to firmly establish a strong culture of psychological safety. Psychological safety in the workplace is important, not only for teams to feel engaged, but also for members to thrive. While it might be trickier to establish psychological safety remotely, its definitely possible. Some of the ways to do it:

        1. Default to open communication wherever possible.
        2. Engage people to speak about issues openly during retros and all-hands meetings.
        3. Be transparent about things that have not worked well for you. Setting this example will help people open themselves up to be vulnerable with their teams.

        First 90 Days - Focus on Yourself

        How do you manage and moderate your own emotions as you find your way in a new organization with this new way of working?

        FOMO Is Real

        Starting in a new place is nerve wracking. Starting while fully remote can be a lonely exercise. Working in a global company like Shopify means that you need to get used to the fact that work is always happening in some part of the globe. It’s easy to get overwhelmed and be always on. While FOMO can be very real, be aware of all the new information that you’re ingesting and take the time to reflect upon it.

        Design Your Workday

        Remote work means you’re not chained to your nine-to-five routine anymore. Reflect on the possibilities this offers you and think about how you want to design your workday. Maybe you want meeting free times to walk the dog, hit the gym, or take a power nap. Ensure you think about how to structure the day in a way that suits your life and plan your agenda accordingly.

        Try New Things

        It’s pretty intense in the first few months as you try ways to establish trust and build a strong team together. Not everything you try will take and not everything will work. The important thing is to be clear with what you’re setting out to achieve, collect feedback on what works and doesn’t, learn from the experience, and move forward.

        Being able to work in a remote work environment is both rewarding and fun. It’s definitely a new superpower that, if used well, leads to rich and absorbing experiences. The first 90 days are just the beginning of this journey. Sit back, tighten your seatbelt and get ready for a joyride of learning and growth.

        Sadhana is an engineer, book nerd, constant learner and enabler of people. Worked in the industry for more than 20 years in various roles and tech stacks. Agile Enthusiast. You can connect with Sadhana on LinkedIn and Medium.


        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        Introducing ShopifyQL: Our New Commerce Data Querying Language 

        Introducing ShopifyQL: Our New Commerce Data Querying Language 

        At Shopify, we recognize the positive impact data-informed decisions have on the growth of a business. But we also recognize that data exploration is gated to those without a data science or coding background. To make it easier for our merchants to inform their decisions with data, we built an accessible, commerce-focused querying language. We call it ShopifyQL. ShopifyQL enables Shopify Plus merchants to explore their data with powerful features like easy to learn syntax, one-step data visualization, built-in period comparisons, and commerce-specific date functions. 

        I’ll discuss how ShopifyQL makes data exploration more accessible, then dive into the commerce-specific features we built into the language, and walk you through some query examples.

        Why We Built ShopifyQL

        As data scientists, engineers, and developers, we know that data is a key factor in business decisions across all industries. This is especially true for businesses that have achieved product market fit, where optimization decisions are more frequent. Now, commerce is a broad industry and the application of data is deeply personal to the context of an individual business, which is why we know it’s important that our merchants be able to explore their data in an accessible way.

        Standard dashboards offer a good solution for monitoring key metrics, while interactive reports with drill-down options allow deeper dives into understanding how those key metrics move. However, reports and dashboards help merchants understand what happened, but not why it happened. Often, merchants require custom data exploration to understand the why of a problem, or to investigate how different parts of the business were impacted by a set of decisions. For this, they turn to their data teams (if they have them) and the underlying data.

        Historically, our Shopify Plus merchants with data teams have employed a centralized approach in which data teams support multiple teams across the business. This strategy helps them maximize their data capability and consistently prioritizes data stakeholders in the business. Unfortunately, this leaves teams in constant competition for their data needs. Financial deep dives are prioritized over operational decision support. This leaves marketing, merchandising, fulfillment, inventory, and operations to fend for themselves. They’re then forced to either make decisions with the standard reports and dashboards available to them, or do their own custom data exploration (often in spreadsheets). Most often they end up in the worst case scenario: relying on their gut and leaving data out of the decision making process.

        Going past the reports and dashboards into the underlying datasets that drive them is guarded by complex data engineering concepts and languages like SQL. The basics of traditional data querying languages are easy to learn. However, applying querying languages to datasets requires experience with, and knowledge of, the entire data lifecycle (from data capture to data modeling). In some cases, simple commerce-specific data explorations like year-over-year sales require a more complicated query than the basic pattern of selecting data from some table with some filter. This isn’t a core competency of our average merchant. They get shut out from the data exploration process and the ability to inform their decisions with insights gleaned from custom data explorations. That’s why we built ShopifyQL.

        A Data Querying Language Built for Commerce

        We understand that merchants know their business the best and want to put the power of their data into their hands. Data-informed decision making is at the heart of every successful business, and with ShopifyQL we’re empowering Shopify Plus merchants to gain insights at every level of data analysis. 

        With our new data querying language, ShopifyQL, Shopify Plus merchants can easily query their online store data. ShopifyQL makes commerce data exploration accessible to non-technical users by simplifying traditional aspects of data querying like:

        • Building visualizations directly from the query, without having to manipulate data with additional tools.
        • Creating year-over-year analysis with one simple statement, instead of writing complicated SQL joins.
        • Referencing known commerce date ranges (For example, Black Friday), without having to remember the exact dates.
        • Accessing data specifically modeled for commerce exploration purposes, without having to connect the dots across different data sources. 

        Intuitive Syntax That Makes Data Exploration Easy

        The ShopifyQL syntax is designed to simplify the traditional complexities of data querying languages like SQL. The general syntax tree follows a familiar querying structure:

        FROM {table_name}
        SHOW|VISUALIZE {column1, column2,...} 
        TYPE {visualization_type}
        AS {alias1,alias2,...}
        BY {dimension|date}
        WHERE {condition}
        SINCE {date_offset}
        UNTIL {date_offset}
        ORDER BY {column} ASC|DESC
        COMPARE TO {date_offset}
        LIMIT {number}

        We kept some of the fundamentals of the traditional querying concepts because we believe these are the bedrock of any querying language:

        • FROM: choose the data table you want to query
        • SELECT: we changed the wording to SHOW because we believe that data needs to be seen to be understood. The behavior of the function remains the same: choose the fields you want to include in your query
        • GROUP BY: shortened to BY. Choose how you want to aggregate your metrics
        • WHERE: filter the query results
        • ORDER BY: customize the sorting of the query results
        • LIMIT: specify the number of rows returned by the query.

        On top of these foundations, we wanted to bring a commerce-centric view to querying data. Here’s what we are making available via Shopify today.

        1. Start with the context of the dataset before selecting dimensions or metrics

        We moved FROM to precede SHOW. It’s more intuitive for users to select the dataset they care about first and then the fields. When wanting to know conversion rates it's natural to think about product and then conversion rates, that's why we swapped the order of FROM and SHOW as compared to traditional querying languages.

        2. Visualize the results directly from the query

        Charts are one of the most effective ways of exploring data, and VISUALIZE aims to simplify this process. Most query languages and querying interfaces return data in tabular format and place the burden of visualizing that data on the end user. This means using multiple tools, manual steps, and copy pasting. The VISUALIZE keyword allows Shopify Plus merchants to display their data in a chart or graph visualization directly from a query. For example, if you’re looking to identify trends in multiple sales metrics for a particular product category:

        A screenshot showing the ShopifyQL code at the top of the screen and a line chart that uses VISUALIZE to chart monthly total and gross sales
        Using VISUALIZE to chart monthly total and gross sales

        We’ve made the querying process simpler by introducing smart defaults that allow you to get the same output with less lines of code. The query from above can also be written as:

        FROM sales
        VISUALIZE total_sales, gross_sales
        BY month
        WHERE product_category = ‘Shoes’
        SINCE -13m

        The query and the output relationship remains explicit, but the user is able to get to the result much faster.

        The following language features are currently being worked on, and will be available later this year:

        3. Period comparisons are native to the ShopifyQL experience

        Whether it’s year-over-year, month-over-month, or a custom date range, period comparison analyses are a staple in commerce analytics. With traditional querying languages, you either have to model a dataset to contain these comparisons as their own entries or write more complex queries that include window functions, common table expressions, or self joins. We’ve simplified that to a single statement. The COMPARE TO keyword allows ShopifyQL users to effortlessly perform period-over-period analysis. For instance, comparing this week’s sales data to last week:

        A screenshot showing the ShopifyQL code at the top of the screen and a line chart that uses VISUALIZE for comparing total sales between 2 time periods with COMPARE TO
        Comparing total sales between 2 time periods with COMPARE TO

        This powerful feature makes period-over-period exploration simpler and faster; no need to learn joins or window functions. Future development will enable multiple comparison periods for added functionality.

        4. Commerce specific date ranges simplify time period filtering

        Commerce-specific date ranges (for example Black Friday Cyber Monday, Christmas Holidays, or Easter) involve a manual lookup or a join to some holiday dataset. With ShopifyQL, we take care of the manual aspects of filtering for these date ranges and let the user focus on the analysis.

        The DURING statement, in conjunction with Shopify provided date ranges, allows ShopifyQL users to filter their query results by commerce-specific date ranges. For example, finding out what the top five selling products were during BFCM in 2021 versus 2020:

        A screenshot showing the ShopifyQL code at the top of the screen and a table that shows Product Title, Total Sales BFCM 2021, and Total Sales BFCM 2019
        Using DURING to simplify querying BFCM date ranges

        Future development will allow users to save their own date ranges unique to their business, giving them even more flexibility when exploring data for specific time periods.

        Check out our full list of current ShopifyQL features and language docs at shopify.dev.

        Data Models That Simplify Commerce-Specific Analysis and Explorations

        ShopifyQL allows us to access data models that address commerce-specific use cases and abstract the complexities of data transformation. Traditionally, businesses trade off SQL query simplicity for functionality, which limits users’ ability to perform deep dives and explorations. Since they can’t customize the functionality of SQL, their only lever is data modeling. For example, if you want to make data exploration more accessible to business users via simple SQL, you have to either create one flat table that aggregates across all data sources, or a number of use case specific tables. While this approach is useful in answering simple business questions, users looking to dig deeper would have to write more complex queries to either join across multiple tables, leverage window functions and common table expressions, or use the raw data and SQL to create their own models. 

        Alongside ShopifyQL we’re building exploration data models that are able to answer questions across the entire spectrum of commerce: products, orders, and customers. Each model focuses on the necessary dimensions and metrics to enable data exploration associated with that domain. For example, our product exploration dataset allows users to explore all aspects of product sales such as conversion, returns, inventory, etc. The following characteristics allow us to keep these data model designs simple while maximizing the functionality of ShopifyQL:

        • Single flat tables aggregated to a lowest domain dimension grain and time attribute. There’ is no need for complicated joins, common table expressions, or window functions. Each table contains the necessary metrics that describe that domain’s interaction across the entire business, regardless of where the data is coming from (for example, product pageviews and inventory are product concerns from different business processes).
        • All metrics are fully additive across all dimensions. Users are able to leverage the ShopifyQL aggregation functions without worrying about which dimensions are conformed. This also makes table schemas relatable to spreadsheets, and easy to understand for business users with no experience in data modeling practices.
        • Datasets support overlapping use cases. Users can calculate key metrics like total sales in multiple exploration datasets, whether the focus is on products, orders, or customers. This allows users to reconcile their work and gives them confidence in the queries they write.

        Without the leverage of creating our own querying language, the characteristics above would require complex queries which would limit data exploration and analysis.

        ShopifyQL Is a Foundational Piece of Our Platform

        We built ShopifyQL for our Shopify Plus merchants, third-party developer partners, and ourselves as a way to serve merchant-facing commerce analytics. 

        Merchants can access ShopifyQL via our new first party app ShopifyQL Notebooks

        We used the ShopifyQL APIs to build an app that allows our Shopify Plus merchants to write ShopifyQL queries inside a traditional notebooks experience. The notebooks app gives users the ultimate freedom of exploring their data, performing deep dives, and creating comprehensive data stories. 

        ShopifyQL APIs enable our partners to easily develop analytics apps

        The Shopify platform allows third-party developers to build apps that enable merchants to fully customize their Shopify experience. We’ve built GraphQL endpoints for access to ShopifyQL and the underlying datasets. Developers can leverage these APIs to submit ShopifyQL queries and return the resulting data in the API response. This allows our developer partners to save time and resources by querying modeled data. For more information about our GraphQL API, check out our API documentation.

        ShopifyQL will power all analytical experiences on the Shopify platform

        We believe ShopifyQL can address all commerce analytics use cases. Our internal teams are going to leverage ShopifyQL to power the analytical experiences we create in the Shopify Admin—the online backend where merchants manage their stores. This helps us standardize our merchant-facing analytics interfaces across the business. Since we’re also the users of the language, we’re acutely aware of its gaps, and can make changes more quickly.

        Looking ahead

        We’re planning new language features designed to make querying with ShopifyQL even simpler and more powerful:

        • More visualizations: Line and bar charts are great but, we want to provide more visualization options that help users discover different insights. New visualizations on the roadmap include dual axis charts, funnels, annotations, scatter plots, and donut charts.
        • Pivoting: Pivoting data with a traditional SQL query is a complicated endeavor. We will simplify this with the capability to break down a metric by dimensional attributes in a columnar fashion. This will allow for charting trends of dimensional attributes across time for specific metrics with one simple query.
        • Aggregate conditions: Akin to a HAVING statement in SQL, we are building the capability for users to filter their queries on an aggregate condition. Unlike SQL, we’re going to allow for this pattern in the WHERE clause, removing the need for additional language syntax and keyword ordering complexity.

        As we continue to evolve ShopifyQL, our focus will remain on making commerce analytics more accessible to those looking to inform their decisions with data. We’ll continue to empower our developer partners to build comprehensive analytics apps, enable our merchants to make the most out of their data, and support our internal teams with powering their merchant-facing analytical use cases.

        Ranko is a product manager working on ShopifyQL and data products at Shopify. He's passionate about making data informed decisions more accessible to merchants.


        Are you passionate about solving data problems and eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.

        Continue reading

        Start your free 14-day trial of Shopify