The Management Poles of Developer Infrastructure Teams

The Management Poles of Developer Infrastructure Teams

Over the past few years, as I’ve been managing multiple developer infrastructure teams at once, I’ve found some tensions that are hard to resolve. In my current mental model, I have found that there are three poles that have a natural tension and are thus tricky to balance: management support, system and domain expertise, and road maps. I’m going to discuss the details of these poles and some strategies I’ve tried to manage them.

What’s Special About Developer Infrastructure Teams?

Although this model likely can apply to any software development team, the nature of developer infrastructure (Dev Infra) makes this situation particularly acute for managers in our field. These are some of the specific challenges faced in Dev Infra:

  • Engineering managers have a lot on their plates. For whatever reason, infrastructure teams usually lack dedicated product managers, so we often have to step in to fill that gap. Similarly, we’re responsible for tasks that usually fall to UX experts, such as doing user research.
  • There’s a lot of maintenance and support. Teams are responsible for keeping multiple critical systems online with hundreds or thousands of users, usually with only six to eight developers. In addition, we often get a lot of support requests, which is part of the cost of developing in-house software that has no extended community outside the company.
  • As teams tend to organize around particular phases in the development workflow, or sometimes specific technologies, there’s a high degree of domain expertise that’s developed over time by all its members. This expertise allows the team to improve their systems and informs the team’s road map.

What Are The Three Poles?

The Dev Infra management poles I’ve modelled are tensions, much like that between product and engineering. They can’t, I don’t believe, all be solved at the same time—and perhaps they shouldn’t be. We, Dev Infra managers, balance them according to current needs and context and adapt as necessary. For this balancing act, it behooves us to make sure we understand the nature of these poles.

1. Management Support

Supporting developers in their career growth is an important function of any engineering manager. Direct involvement in team projects allows the tightest feedback loops between manager and report, and thus the highest-quality coaching and mentorship. We also want to maximize the number of reports per manager. Good managers are hard to find, and even the best manager adds a bit of overhead to a team’s impact.

We want the manager to be as involved in their reports’ work as possible, and we want the highest number of reports per manager that they can handle. Where this gets complicated is balancing the scope and domain of individual Dev Infra teams and of the whole Dev Infra organization. This tension is a direct result of the need for specific system and domain expertise on Dev Infra teams.

2. System and Domain Expertise

As mentioned above, in Dev Infra we tend to build teams around domains that represent phases in the development workflow, or occasionally around specific critical technologies. It’s important that each team has both domain knowledge and expertise in the specific systems involved. Despite this focus, the scope of and opportunities in a given area can be quite broad, and the associated systems grow in size and complexity.

Expertise in a team’s systems is crucial just to keep everything humming along. As with any long-running software application, dependencies need to be managed, underlying infrastructure has to be occasionally migrated, and incidents must be investigated and root causes solved. Furthermore, at any large organization, Dev Infra services can have many users relative to the size of the teams responsible for them. Some teams will require on-call schedules in case a critical system breaks during an emergency (finding out the deployment tool is down when you’re trying to ship a security fix is, let’s just say, not a great experience for anyone).

A larger team means less individual on-call time and more hands for support, maintenance, and project work. As teams expand their domain knowledge, more opportunities are discovered for increasing the impact of the team’s services. The team will naturally be driven to constantly improve the developer experience in their area of expertise. This drive, however, risks a disconnect with the greatest opportunities for impact across Dev Infra as a whole.

3. Road Maps

Specializing Dev Infra teams in particular domains is crucial for both maintenance and future investments. Team road maps and visions improve and expand upon existing offerings: smoothing interfaces, expanding functionality, scaling up existing solutions, and looking for new opportunities to impact development in their domain. They can make a big difference to developers during particular phases of their workflow like providing automation and feedback while writing code, speeding up continuous integration (CI) execution, avoiding deployment backlogs, and monitoring services more effectively.

As a whole Dev Infra department, however, the biggest impact we can have on development at any given time changes. When Dev Infra teams are first created, there’s usually a lot of low- hanging fruit—obvious friction at different points in the development workflow—so multiple teams can broadly improve the developer experience in parallel. At some point, however, some aspects of the workflow will be much smoother than others. Maybe CI times have finally dropped to five minutes. Maybe deploys rarely need attention after being initiated. At a large organization, there will always be edge cases, bugs, and special requirements in every area, but their impact will be increasingly limited when compared to the needs of the engineering department as a whole.

At this point, there may be an opportunity for a large new initiative that will radically impact development in a particular way. There may be a few, but it’s unlikely that there will be the need for radical changes across all domains. Furthermore, there may be unexplored opportunities and domains for which no team has been assembled. These can be hard to spot if the majority of developers and managers are focused on existing well-defined scopes.

How to Maintain the Balancing Act

Here’s the part where I confess that I don’t have a single amazing solution to balance management support, system maintenance and expertise, and high-level goals. Likely there are a variety of solutions that can be applied and none are perfect. Here are three ideas I’ve thought about and experimented with.

1. Temporarily Assign People from One Team to a Project on Another

If leadership has decided that the best impact for our organization at this moment is concentrated in the work of a particular team, call it Team A, and if Team A’s manager can’t effectively handle any more reports, then a direct way to get more stuff done is to take a few people from another team (Team B) and assign them to the Team A’s projects. This has some other benefits as well: it increases the number of people with familiarity in Team A’s systems, and people sometimes like to change up what they’re working on.

When we tried this, the immediate question was “should the people on loan to Team A stay on the support rotations for their ‘home’ team?” From a technical expertise view, they’re important to keep the lights on in the systems they’re familiar with. Leaving them on such rotations prevents total focus on Team A, however, and at a minimum extends the onboarding time. There are a few factors to consider: the length of the project(s), the size of Team B, and the existing maintenance burden on Team B. Favour removing the reassigned people from their home rotations, but know that this will slow down Team A’s work even more as they pick up the extra work.

The other problem we ran into is that the manager of Team B is disconnected from the work their reassigned reports are now working on. Because the main problem is that Team A’s manager doesn’t have enough bandwidth to have more reports, there’s less management support for the people on loan, in terms of mentoring, performance management, and prioritization. The individual contributor (IC) can end up feeling disconnected from both their home team and the new one.

2. Have a Whole Team Contribute to Another Team’s Goals

We can mitigate at least the problem of ICs feeling isolated in their new team if we have the entire team (continuing the above nomenclature, Team B) work on the systems that another team (Team A) owns. This allows members of Team B to leverage their existing working relationships with each other, and Team B’s manager doesn’t have to split their attention between two teams. This arrangement can work well if there is a focused project in Team A’s domain that somehow involves some of Team B’s domain expertise.

This is, of course, a very blunt instrument, in that no project work will get done on Team B’s systems, which themselves still need to be maintained. There’s also a risk of demotivating the members of Team B, who may feel that their domain and systems aren’t important, although this can be mitigated to some extent if the project benefits or requires their domain expertise. We’ve had success here in exactly that way in an ongoing project done by our Test Infrastructure team to add data from our CI systems to Services DB, our application-catalog app stewarded by another team, Production Excellence. Their domain expertise allowed them to understand how to expose the data in the most intuitive and useful way, and they were able to more rapidly learn Services DB’s codebase by working together.

3. Tiger Team

A third option we’ve tried out in Dev Infra is a tiger team: “a specialized, cross-functional team brought together to solve or investigate a specific problem or critical issue.” People from multiple teams form a new, temporary team for a single project, often prototyping a new idea. Usually the team operates in a fast-paced, autonomous way towards a very specific goal, so management oversight is fairly limited. By definition, most people on a tiger team don’t usually work together, so the home and new team dichotomy is sidestepped, or at least very deliberately managed. The focus of the team means that members put aside maintenance, support, and other duties from their home team for the duration of the team’s existence.

The very first proof of concept for Spin was built this way over about a month. At that time, the value was sufficiently clear that we then formed a whole team around Spin and staffed it up to tackle the challenge of turning it into a proper product. We’ve learned a lot since then, but that first prototype was crucial in getting the whole project off the ground!

No Perfect Solutions

From thinking about and experimenting with team structures during my decade of management experience, there doesn’t seem to be a perfect solution to balance the three poles of management support, system maintenance and domain expertise, and high-level goals. Each situation is unique, and trade-offs have to be judged and taken deliberately. I would love to hear other stories of such balancing acts! Find me on Twitter and LinkedIn.

Mark Côté is the Director of Engineering, Developer Infrastructure, at Shopify. He's been in the developer-experience space for over a decade and loves thinking about infrastructure-as-product and using mental models in his strategies.

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

Hubble: Our Tool for Encapsulating and Extending Security Tools

Hubble: Our Tool for Encapsulating and Extending Security Tools

Fundamentally, Shopify is a company that thrives by building simplicity. We take hard, risky, and complex things and make them easy, safe, and simple.

Trust is Shopify’s team responsible for making commerce secure for everyone. First and foremost, that means securing our internal systems and IT resources, and maintaining a strong cybersecurity posture. If you’ve worked in these spaces before, you know that it takes a laundry list of tools to effectively manage and secure a large fleet of computers. Not only does it take tons of tools, but also takes training, access provisioning and deprovisioning, and constant patching. In any large or growing company, these problems compound and can become exponential costs if they aren’t controlled and solved for.

You either pay that cost by spending countless human hours on menial processes and task switching, or you accept the risk of shadow IT—employees developing their own processes and workarounds rather than following best practices. You either get choked by bureaucracy, or you create such a low trust environment that people don’t feel their company is interested in solving their problems.

Shopify is a global company that, in 2020, embraced being Digital by Design—in essence, the firm belief that our people have the greatest impact when we support them to work whenever and wherever they like. As you can imagine, this only magnified the problems described above. With the end of office centricity, suddenly the work of securing our devices got a lot more important, and a lot more difficult. Network environments got more varied, the possibility of in-person patching or remediation went out the window—the list goes on. Faced with these challenges, we searched for off-the-shelf solutions, but couldn’t find anything that fully fit our needs.

Hubble logo which features an image of a telescope followed by the word Hubble
Hubble Logo

So, We Built Hubble.

An evolution of previous internal solutions, Hubble is a tool that encapsulates and extends many of the common tools used in security. Mobile device management services and more are all fully integrated into Hubble. For IT staff, Hubble is a one stop shop for inventory, device management, and security. Rather than granting hundreds of employees access to multiple admin panels, they access Hubble—which ingests and standardizes data from other systems, and then sends commands back to those systems. We also specify levels of granularity in access (a specialist might have more access than an entry level worker, for instance). On the back end, we also track and audit access in one central location with a consistent set of fields—making incident response and investigation less of a rabbit hole.

A screenshot of the Hubble screen on a Macbook Pro. It shows a profile picture and several lines of status updates about the machine
Hubble’s status screen on a user’s machine

For everyone else at Shopify, Hubble is a tool to manage and view the devices that belong to them. At a glance, they can review the health of their device and its compliance, and not just an arbitrary set of metrics, but something that we define and find valuable - things like OS/Patch Compliance, VPN usage, and more. Folks don’t need to ask IT or just wonder if their device is secure. Hubble informs them, either via the website or device notification pings. And if their device isn’t secure, Hubble provides them with actionable information on how to fix it. Users can also specify test devices, or opt in to betas that we run. This enables us to easily build beta cohorts for any testing we might be running. When you give people the tools to be proactive about their security, and show that you support that proactivity, you help build a culture of ownership.

And, perhaps most importantly, Hubble is a single source of truth for all the data it consumes. This makes it easier for other teams to develop automations and security processes. They don’t have to worry about standardizing data, or making calls to 100 different services. They can access Hubble, and trust that the data is reliable and standardized.

Now, why should you care about this? Hubble is an internal tool for Shopify, and unfortunately it isn’t open source at this time. But these two lessons we learned building and realizing Hubble are valuable and applicable anywhere.

1. When the conversation is centered on encapsulation, the result is a partnership in creating a thoughtful and comprehensive solution.

Building and maintaining Hubble requires a lot of teams talking to each other. Developers talk to support staff, security engineers, and compliance managers. While these folks often work near each other, they rarely work directly together. This kind of collaboration is super valuable and can help you identify a lot of opportunities for automation and development. Plus, it presents the opportunity for team members to expand their skills, and maybe have an idea of what their next role could be. Even if you don’t plan to build a tool like this, consider involving frontline staff with the design and engineering processes in your organization. They bring valuable context to the table, and can help surface the real problems that your organization faces.

2. It’s worth fighting for investment.

IT and Cybersecurity are often reactive and ad-hoc driven teams. In the worst cases, this field lends itself to unhealthy cultures and an erratic work life balance. Incident response teams and frontline support staff often have unmanageable workloads and expectations, in large part due to outdated tooling and processes. We strive to make sure it isn’t like that at Shopify, and it doesn’t have to be that way where you work. We’ve been able to use Hubble as a platform for identifying automation opportunities. By having engineering teams connected to support staff via Hubble, we encourage a culture of proactivity. Teams don’t just accept processes as broken and outdated—they know that there’s talent and resources available for solving problems and making things better. Beyond culture and work life balance, consider the financial benefits and risk-minimization that this strategy realizes.

For each new employee onboarded to your IT or Cybersecurity teams, you spend weeks if not months helping them ramp up and safely access systems. This can incur certification and training costs (which can easily run in the thousands of dollars per employee if you pay for their certifications), and a more difficult job search to find the right candidate. Then you take on the risk of all these people having direct access to sensitive systems. And finally, you take on the audit and tracking burden of all of this.

With each tool you add to your environment, you increase complexity exponentially. But there’s a reason those tools exist, and complexity on its own isn’t a good enough reason to reject a tool. This is a field where costs want to grow exponentially. It seems like the default is to either accept that cost and the administrative overhead it brings, or ignore the cost and just eat the risk. It doesn’t have to be that way.

We chose to invest and to build Hubble to solve these problems at Shopify. Encapsulation can keep you secure while keeping everyone sane at the same time.

Tony is a Senior Engineering Program Manager and leads a team focussed on automation and internal support technologies. He’s journaled daily for more than 9 years, and uses it as a fun corpus for natural language analysis. He likes finding old bread recipes and seeing how baking has evolved over time!

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

How to Structure Your Data Team for Maximum Influence

How to Structure Your Data Team for Maximum Influence

One of the biggest challenges most managers face (in any industry) is trying to assign their reports work in an efficient and effective way. But as data science leaders—especially those in an embedded model—we’re often faced with managing teams with responsibilities that traverse multiple areas of a business. This juggling act often involves different streams of work, areas of specialization, and stakeholders. For instance, my team serves five product areas, plus two business areas. Without a strategy for dealing with these stakeholders and related areas of work, we risk operational inefficiency and chaotic outcomes. 

There are many frameworks out there that suggest the most optimal way to structure a team for success. Below, we’ll review these frameworks and their positives and negatives when applied to a data science team. We’ll also share the framework that’s worked best for empowering our data science teams to drive impact.

An example of the number of product and business areas my team supports at Shopify
An example of the number of product and business areas my data team supports at Shopify

First, Some Guiding Principles

Before looking at frameworks for managing these complex team structures, I’ll first describe some effective guiding principles we should use when organizing workflows and teams:  

  1. Efficiency: Any structure must provide an ability to get work done in an efficient and effective manner. 
  2. Influence: Structures must be created in such a way that your data science team continues to have influence on business and product strategies. Data scientists often have input that is critical to business and product success, and we want to create an environment where that input can be given and received.
  3. Stakeholder clarity: We need to create a structure where stakeholders clearly know who to contact to get work done, and seek help and advice from.
  4. Stability: Some teams structures can create instability for reports, which leads to a whole host of other problems.
  5. Growth: If we create structures where reports only deal with stakeholders and reactive issues, it may be difficult for them to develop professionally. We want to ensure reports have time to tackle work that enables them to acquire a depth of knowledge in specific areas.
  6. Flexibility: Life happens. People quit, need change, or move on. Our team structures need to be able to deal with and recognize that change is inevitable. 

Traditional Frameworks for Organizing Data Teams

Alright, now let’s look at some of the more popular frameworks used to organize data teams. While they’re not the only ways to structure teams and align work, these frameworks cover most of the major aspects in organizational strategy. 

Swim Lanes

You’ve likely heard of this framework before, and maybe even cringed when someone has told you or your report to "stay in your swim lanes". This framework involves assigning someone to very strictly defined areas of responsibility. Looking at the product and business areas my own team supports as an example, we have seven different groups to support. According to the swim lane framework, I would assign one data scientist to each group. With an assigned product or business group, their work would never cross lanes. 

In this framework, there's little expected help or cross-training that occurs, and everyone is allowed to operate with their own fiefdom. I once worked in an environment like this. We were a group of tenured data scientists who didn’t really know what the others were doing. It worked for a while, but when change occurred (new projects, resignations, retirements) it all seemed to fall apart.  

Let’s look at this framework’s advantages: 

  • Distinct areas of responsibility. In this framework, everyone has their own area of responsibility. As a manager, I know exactly who to assign work to and where certain tasks should go on our board. I can be somewhat removed from the process of workload balancing.  
  • High levels of individual ownership. Reports own an area of responsibility and have a stake in its success. They also know that their reputation and job are on the line for the success or failure of that area.
  • The point-of-contact is obvious to stakeholders. Ownership is very clear to stakeholders, so they always know who to go. This model also fosters long-term relationships. 

And the disadvantages:

  • Lack of cross-training. Individual reports will have very little knowledge of the work or codebase of their peers. This becomes an issue when life happens and we need to react to change.
  • Reports can be left on an island. Reports can be left alone which tends to matter more when times are tough. This is a problem for both new reports who are trying to onboard and learn new systems, but also for tenured reports who may suddenly endure a higher workload. Help may not be coming.  
  • Fails under high-change environments. For the reasons mentioned above, this system fails under high-change environments. It also creates a team-level rigidity that means when general organizational changes happen, it’s difficult to react and pivot.

Referring back to our guiding principles when considering how to effectively organize a date team, this framework hits our stakeholder clarity and efficiency principles, but only in stable environments. Swim lanes often fail in conditions of change or when the team needs to pivot to new responsibilities—something most teams should expect.  

Stochastic Process

As data scientists, we’re often educated in the stochastic process and this framework resembles this theory. As a refresher, the stochastic process is defined by randomness of assignment, where expected behavior is near random assignments to areas or categories.  

Likewise, in this framework each report takes the next project that pops up, resembling a random assignment of work. However, projects are prioritized and when an employee finishes one project, they take on the next, highest priority project. 

This may sound overly random as a system, but I’ve worked on a team like this before. We were a newly setup team, and no one had any specific experience with any of the work we were doing. The system worked well for about six months, but over the course of a year, we felt like we'd been put through the wringer and as though no one had any deep knowledge of what we were working on.  

The advantages of this framework are:

  • High levels of team collaboration. Everyone is constantly working on each other’s code and projects, so a high-level of collaboration tends to develop.
  • Reports feel like there is always help. Since work is assigned in terms of next priority gets the resource, if someone is struggling with a high-priority task, they can just ask for help from the next available resource.
  • Extremely flexible under high levels of change. Your organization decides to reorg to align to new areas of the business? No problem! You weren’t aligned to any specific groups of stakeholders to begin with. Someone quits? Again, no problem. Just hire someone new and get them into the rotation.

And the disadvantages:

  • Can feel like whiplash. As reports are asked to move constantly from one unrelated project to the next, they can develop feelings of instability and uncertainty (aka whiplash). Additionally, as stakeholders work with a new resource on each project, this can limit the ability to develop rapport.
  • Inability to go deep on specialized subject matters. It’s often advantageous for data scientists to dive deep into one area of the business or product. This enables them to develop deep subject area knowledge in order to build better models. If we’re expecting them to move from project to project, this is unlikely to occur.
  • Extremely high management inputs. As data scientists become more like cogs in a wheel in this type of framework, management ends up owning most stakeholder relationships and business knowledge. This increases demands on individual managers.

Looking at the advantages and disadvantages of this framework, and measuring them against our guiding principles, this framework only hits two of our principles: flexibility and efficiency. While this framework can work in very specific circumstances (like brand new teams), the lack of stakeholder clarity, relationship building, and growth opportunity will result in the failure of this framework to sufficiently serve the needs of the team and stakeholders. 

A New Framework: Diamond Defense 

Luckily, we’ve created a third way to organize data teams and work. I like to compare this framework to the concept of diamond defense in basketball. In diamond defense, players have general areas (zones) of responsibility. However, once play starts, the defense focuses on trapping (sending extra resources) to the toughest problems, while helping out areas in the defense that might be left with fewer resources than needed.  

This same defense method can be used to structure data teams to be highly effective. In this framework, you loosely assign reports to your product or business areas, but ensure to rotate resources to tough projects and where help is needed.

Referring back to the product and business areas my team supports, you can see how I use this framework to organize my team: 

An example of how I use the diamond defense framework to structure my data team and align them to zones of work
An example of how I use the diamond defense framework to structure my data team

Each data scientist is assigned to a zone. I then aligned our additional business areas (Finance and Marketing) to a product group, and assigned resources to these groupings. Finance and Marketing are aligned differently here because they are not supported by a team of Software Engineers. Instead, I aligned them to the product group that mostly closely resembles their work in terms of data accessed and models built. Currently, Marketing has the highest number of requests for our team, so I added more resources to support this group. 

You’ll notice on the chart that I keep myself and an additional data scientist in a bullpen. This is key to the diamond defense as it ensures we always have additional resources to help out where needed. Let’s dive into some examples of how we may use resources in the bullpen: 

  1. DS2 is under-utilized. We simultaneously find out that DS1 is overwhelmed by the work of their product area, so we tap DS2 to help out. 
  2. SR DS1 quits. In this case, we rotate DS4 into their place, and proceed to hire a backfill. 
  3. SR DS2 takes a leave of absence. In this situation, I as the manager slide in to manage SR DS2’s stakeholders. I would then tap DS4 to help out, while the intern who is also assigned to the same area continues to focus on getting their work done with help from DS4. 

This framework has several advantages:

  • Everyone has dedicated areas to cover and specialize in. As each report is loosely assigned to a zone (specific product or business area), they can go deep and develop specialized skills.  
  • Able to quickly jump on problems that pop up. Loose assignment to zones enable teams the flexibility to move resources to the highest-priority areas or toughest problems.  
  • Reports can get the help they need. If a report is struggling with the workload, you can immediately send more resources towards that person to lighten their load.  

And the disadvantages:

  • Over-rotation. In certain high-change circumstances, a situation can develop where data scientists spend most of their time covering for other people. This can create very volatile and high-risk situations, including turnover.

This framework hits all of our guiding principles. It provides the flexibility and stability needed when dealing with change, it enables teams to efficiently tackle problems, focus areas enable report growth and stakeholder clarity, and relationships between reports and their stakeholders improves the team's ability to influence policies and outcomes. 


There are many ways to organize data teams to different business or product areas, stakeholders, and bodies of work. While the traditional frameworks we discussed above can work in the short-term, they tend to over-focus either on rigid areas of responsibility or everyone being able to take on any project. 

If you use one of these frameworks and you’re noticing that your team isn’t working as effectively as you know they can, give our diamond defense framework a try. This hybrid framework addresses all the gaps of the traditional frameworks, and ensures:

  • Reports have focus areas and growth opportunity 
  • Stakeholders have clarity on who to go to
  • Resources are available to handle any change
  • Your data team is set up for long-term success and impact

Every business and team is different, so we encourage you to play around with this framework and identify how you can make it work for your team. Just remember to reference our guiding principles for complex team structures.

Levi manages the Banking and Accounting data team at Shopify. He enjoys finding elegant solutions to real-world business problems using math, machine learning, and elegant data models. In his spare time he enjoys running, spending time with his wife and daughters, and farming. Levi can be reached via LinkedIn.

Are you passionate about solving data problems and eager to learn more about Shopify? Check out openings on our careers page.

Continue reading

Finding Relationships Between Ruby’s Top 100 Packages and Their Dependencies

Finding Relationships Between Ruby’s Top 100 Packages and Their Dependencies

In June of this year, RubyGems, the main repository for Ruby packages (gems), announced that multi-factor authentication (MFA) was going to be gradually rolled out to users. This means that users eventually will need to login with a one-time password from their authenticator device, which will drastically reduce account takeovers.

The team I'm interning on, the Ruby Dependency Security team at Shopify, played a big part in rolling out MFA to RubyGems users. The team’s mission is to increase the security of the Ruby software supply chain, so increasing MFA usage is something we wanted to help implement.

A large Ruby with stick arms and leg pats a little Ruby with stick arms and legs
Illustration by Kevin Lin

One interesting decision that the RubyGems team faced is determining who was included in the first milestone. The team wanted to include at least the top 100 RubyGems packages, but also wanted to prevent packages (and people) from falling out of this cohort in the future.

To meet those criteria, the team set a threshold of 180 million downloads for the gems instead. Once a gem crosses 180 million downloads, its owners are required to use multi-factor authentication in the future.

Bar graph showing gem download numbers for Gem 1 and Gem 2
Gem downloads represented as bars. Gem 2 is over the 180M download threshold, so its owners would need MFA.

This design decision led me to a curiosity. As packages frequently depend on other packages, could some of these big (more than 180M downloads) packages depend on small (less than 180M downloads) packages? If this was the case, then there would be a small loophole: if a hacker wanted to maximize their reach in the Ruby ecosystem, they could target one of these small packages (which would get installed every time someone installed one of the big packages), circumventing the MFA protection of the big packages.

On the surface, it might not make sense that a dependency would ever have fewer downloads than its parent. After all, every time the parent gets downloaded, the dependency does too, so surely the dependency has at least as many downloads as the parent, right?

Screenshot of a Slack conversation between coworkers discussing one's scepticism about finding exceptions
My coworker Jacques, doubting that big gems will rely on small gems. He tells me he finds this hilarious in retrospect.

Well, I thought I should try to find exceptions anyway, and given that this blog post exists, it would seem that I found some. Here’s how I did it.

The Investigation

The first step in determining if big packages depended on small packages was to get a list of big packages. The stats page shows the top 100 gems in terms of downloads, but the last gem on page 10 has 199 million downloads, meaning that scraping these pages would yield an incomplete list, since the threshold I was interested in is 180 million downloads.

A screenshot of a page of statistics
Page 10 of, just a bit above the MFA download threshold

To get a complete list, I instead turned to using the data dumps that makes available. Basically, the site takes a daily snapshot of the database, removes any confidential information, and then publishes it. Their repo has a convenient script that allows you to load these data dumps into your own local database, and therefore run queries on the data using the Rails console. It took me many tries to make a query that got all the big packages, but I eventually found one that worked:

Rubygem.joins(:gem_download).where(gem_download: {count: 180_000_000..}).map(&:name)

I now had a list of 112 big gems, and I had to find their dependencies. The first method I tried was using the API. As described in the documentation, you can give the API the name of a gem and it’ll give you the name of all of its dependencies as part of the response payload. The same endpoint of this API also tells you how many downloads a gem has, so the path was clear: for each big gem, get a list of its dependencies and find out if any of them had fewer downloads than the threshold.

Here are the functions that get the dependencies and downloads:

Ruby function that gets a list of dependencies as reported by the API. Requires built-in uri, net/http, and json packages.
Ruby function that gets downloads from the same API endpoint. Also has a branch to check the download count for specific versions of gems, that I later used.

Putting all of this together, I found that 13 out of the 112 big gems had small gems as dependencies. Exceptions! So why did these small gems have fewer downloads than their parents? I learned that it was mainly due to two reasons:

  1. Some gems are newer than their parents, that is, a new gem came out and a big gem developer wanted to add it as a dependency.
  2. Some gems are shipped with Ruby by default, so they don’t need to be downloaded and thus have low(er) download count (for example, racc and rexml).

With this, I now had proof of the existence of big gems that would be indirectly vulnerable to account takeover of a small gem. While an existence proof is nice, it was pointed out to me that the API only returns a list symbolic of the direct dependencies of a gem, and that those dependencies might have sub-dependencies that I wasn’t checking. So how could I find out which packages get installed when one of these big gems gets installed?

With Bundler, of course!

Bundler is the Ruby dependency manager software that most Ruby users are probably familiar with. Bundler takes a list of gems to install (the Gemfile), installs dependencies that satisfy all version requirements, and, crucially for us, makes a list of all those dependencies and versions in a Gemfile.lock file. So, to find out which big gems relied in any way on small gems, I programmatically created a Gemfile with only the big gem in it, programmatically ran bundle lock, and programmatically read the Gemfile.lock that was created to get all the dependencies.

Here’s the function that did all the work with Bundler:

Ruby function that gets all dependencies that get installed when one gem is installed using Bundler

With this new methodology, I found that 24 of the 112 big gems rely on small gems, which is a fairly significant proportion of them. After discovering this, I wanted to look into visualization. Up until this point, I was just printing out results to the command line to make text dumps like this:

Text dump of dependency results. Big gems are red, their dependencies that are small are indented in black
Text dump of dependency results. Big gems are red, their dependencies that are small are indented in black

This visualization isn’t very convenient to read, and it misses out on patterns. For example, as you can see above, many big gems rely on racc. It would be useful to know if they relied directly on it, or if most packages depended on it indirectly through some other package. The idea of making a graph was in the back of my mind since the beginning of this project, and when I realized how helpful it might be, I committed to it. I used the graph gem, following some examples from this talk by Aja Hammerly. I used a breadth-first search, starting with a queue of all the big gems, adding direct dependencies to the queue as I went. I added edges from gems to their dependencies and highlighted small gems in red. Here was the first iteration:

The output of the graph gem that highlights gem dependencies
The first iteration

It turns out there a lot of AWS gems, so I decided to remove them from the graph and got a much nicer result:

The output of the graph gem that highlights gem dependencies
Full size image link if you want to zoom and pan

The graph, while moderately cluttered, shows a lot of information succinctly. For instance, you can see a galaxy of gems in the middle-left, with rails being the gravitational attractor, a clear keystone in the Ruby world.

Output of the gem graph with Rails at the center
The Rails galaxy

The node with the most arrows pointing into it is activesupport, so it really is an active support.

A close up view of activesupport in the output of the gem graph. activesupport has many arrows pointing into it.
14 arrows pointing into activesupport

Racc, despite appearing in my printouts as a small gem for many big gems, is only the dependency of nokogiri.

A close up view of racc in the output of the gems graph
racc only has 1 edge attached to it

With this nice graph created, I followed up and made one final printout. This time, whenever I found a big gem that depended on a small gem, I printed out all the paths on the graph from the big gem to the small gem, that is, all the ways that the big gem relied on the small gem.

Here’s an example printout:

Big gem is in green (googleauth), small gems are in purple, and the black lines are all the paths from the big gem to the small gem.

I achieved this by making a directional graph data type and writing a depth-first search algorithm to find all the paths from one node to another. I chose to create my own data type because finding all paths on a graph isn’t already implemented in any Ruby gem from what I could tell. Here’s the algorithm, if you’re interested (`@graph` is a Hash of `String:Array` pairs, essentially an adjacency list):

Recursive depth-first search to find all paths from start to end

What’s Next

In summary, I found four ways to answer the question of whether or not big gems rely on small gems:

  1. direct dependency printout (using API)
  2. sub-dependency printout (using Bundler)
  3. graph (using graph gem)
  4. sub-dependency printout with paths (2. using my own graph data type).

I’m happy with my work, and I’m glad I got to learn about file I/O and use graph theory. I’m still relatively new to Ruby, so offshoot projects like these are very didactic.

The question remains of what to do with the 24 technically insecure gems. One proposal is to do nothing, since everyone will eventually need to have MFA enabled, and account takeover is still an uncommon event despite being on the rise. 

Another option is to enforce MFA on these specific gems as a sort of blocklist, just to ensure the security of the top gems sooner. This would mean a small group of owners would have to enable MFA a few months earlier, so I could see this being a viable option. 

Either way, more discussion with my team is needed. Thanks for reading!

Kevin is an intern on the Ruby Dependency Security team at Shopify. He is in his 5th year of Engineering Physics at the University of British Columbia.

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

How to Write Code Without Having to Read It

How to Write Code Without Having to Read It

Do we need to read code before editing it?

The idea isn’t as wild as it sounds. In order to safely fix a bug or update a feature, we may need to learn some things about the code. However, we’d prefer to learn only that information. Not only does extra reading waste time, it overcomplicates our mental model. As our model grows, we’re more likely to get confused and lose track of critical info.

But can we really get away with reading nothing? Spoiler: no. However, we can get closer by skipping over areas that we know the computer is checking, saving our focus for areas that are susceptible to human error. In doing so, we’ll learn how to identify and eliminate those danger areas, so the next person can get away with reading even less.

Let’s give it a try.

Find the Entrypoint

If we’re refactoring code, we already know where we need to edit. Otherwise, we’re changing a behavior that has side effects. In a backend context, these behaviors would usually be exposed APIs. On the frontend, this would usually be something that’s displayed on the screen. For the sake of example, we’ll imagine a mobile application using React Native and Typescript, but this process generalizes to other contexts (as long as they have some concept of build or test errors; more on this later).

If our goal was to read a lot of code, we might search for all hits on RelevantFeatureName. But we don’t want to do that. Even if we weren’t trying to minimize reading, we’ll run into problems if the code we need to modify is called AlternateFeatureName, SubfeatureName, or LegacyFeatureNameNoOneRemembersAnymore.

Instead, we’ll look for something external: the user-visible strings (including accessibility labels—we did remember to add those, right?) on the screen we’re interested in. We search various combinations of string fragments, quotation marks, and UI inspectors until we find the matching string, either in the application code or in a language localization file. If we’re in a localization file, the localization key leads us to the application code that we’re interested in.

If we’re dealing with a regression, there’s an easier option: git bisect. When git bisect works, we really don’t need to read the code. In fact, we can skip most of the following steps. Because this is such a dramatic shortcut, always keep track of which bugs are regressions from previously working code.

Make the First Edit

If we’ve come in to make a simple copy edit, we’re done. If not, we’re looking for a component that ultimately gets populated by the server, disk, or user. We can no longer use exact strings, but we do have several read-minimizing strategies for zeroing in on the component:

  1. Where is this component on the screen, relative to our known piece of text?
  2. What type of standard component is it using? Is it a button? Text input? Text?
  3. Does it have some unusual style parameter that’s easy to search? Color? Corner radius? Shadow?
  4. Which button launches this UI? Does the button have searchable user-facing text?

These strategies all work regardless of naming conventions and code structure. Previous developers would have a hard time making our life harder without making the code nonfunctional. However, they may be able to make our life easier with better structure. 

For example, if we’re using strategy #1, well-abstracted code helps us quickly rule out large areas of the screen. If we’re looking for some text near the bottom of the screen, it’s much easier to hit the right Text item if we can leverage a grouping like this:

<SomeHeader />
<SomeContent />
<SomeFooter />

rather than being stuck searching through something like this:

// Header
<StaticImage />
<Text />
<Text />
<Button />
<Text />
// Content
// Footer

where we’ll have to step over many irrelevant hits.

Abstraction helps even if the previous developer chose wacky names for header, content, or footer, because we only care about the broad order of elements on the screen. We’re not really reading the code. We’re looking for objective cues like positioning. If we’re still unsure, we can comment out chunks of the screen, starting with larger or highly-abstracted components first, until the specific item we care about disappears.

Once we’ve found the exact component that needs to behave differently, we can make the breaking change right now, as if we’ve already finished updating the code. For example, if we’re making a new component that displays data newText, we add that parameter to its parent’s input arguments, breaking the build.

If we’re fixing a bug, we can also start by adjusting an argument list. For example, the condition “we shouldn’t be displaying x if y is present” could be represented with the tagged union {mode: 'x', x: XType} | {mode: 'y'; y: YType}, so it’s physically impossible to pass in x and y at the same time. This will also trigger some build errors.

Tagged Unions
Tagged unions go by a variety of different names and syntaxes depending on language. They’re most commonly referred to as discriminated unions, enums with associated values, or sum types.

Climb Up the Callstack

We now go up the callstack until the build errors go away. At each stage, we edit the caller as if we’ll get the right input, triggering the next round of build errors. Notice that we’re still not reading the code here—we’re reading the build errors. Unless a previous developer has done something that breaks the chain of build errors (for example, accepting any instead of a strict type), their choices don’t have any effect on us.

Once we get to the top of the chain, we adjust the business logic to grab newText or modify the conditional that was incorrectly sending x. At this point, we might be done. But often, our change could or should affect the behavior of other features that we may not have thought about. We need to sweep back down through the callstack to apply any remaining adjustments. 

On the downswing, previous developers’ choices start to matter. In the worst case, we’ll need to comb through the code ourselves, hoping that we catch all the related areas. But if the existing code is well structured, we’ll have contextual recommendations guiding us along the way: “because you changed this code, you might also like…”

Update Recommended Code

As we begin the downswing, our first line of defense is the linter. If we’ve used a deprecated library, or a pattern that creates non-obvious edge cases, the linter may be able to flag it for us. If previous developers forgot to update the linter, we’ll have to figure this out manually. Are other areas in the codebase calling the same library? Is this pattern discouraged in documentation?

After the linter, we may get additional build errors. Maybe we changed a function to return a new type, and now some other consumer of that output raises a type error. We can then update that other consumer’s logic as needed. If we added more cases to an enum, perhaps we get errors from other exhaustive switches that use the enum, reminding us that we may need to add handling for the new case. All this depends on how much the previous developers leaned on the type system. If they didn’t, we’ll have to find these related sites manually. One trick is to temporarily change the types we’re emitting, so all consumers of our output will error out, and we can check if they need updates.

An exhaustive switch statement handles every possible enum case. Most environments don’t enforce exhaustiveness out of the box. For example, in Typescript, we need to have strictNullChecks turned on, and ensure that the switch statement has a defined return type. Once exhaustiveness is enforced, we can remove default cases, so we’ll get notified (with build errors) whenever the enum changes, reminding us that we need to reassess this switch statement.

Our final wave of recommendations comes from unit test failures. At this point, we may also run into UI and integration tests. These involve a lot more reading than we’d prefer; since these tests require heavy mocking, much of the text is just noise. Also, they often fail for unimportant reasons, like timing issues and incomplete mocks. On the other hand, unit tests sometimes get a bad rap for requiring code restructures, usually into more or smaller abstraction layers. At first glance, it can seem like they make the application code more complex. But we didn’t need to read the application code at all! For us, it’s best if previous developers optimized for simple, easy-to-interpret unit tests. If they didn’t, we’ll have to find these issues manually. One strategy is to check git blame on the lines we changed. Maybe the commit message, ticket, or pull request text will explain why the feature was previously written that way, and any regressions we might cause if we change it.

At no point in this process are comments useful to us. We may have passed some on the upswing, noting them down to address later. Any comments that are supposed to flag problems on the downswing are totally invisible—we aren’t guaranteed to find those areas unless they’re already flagged by an error or test failure. And whether we found comments on the upswing or through manual checking, they could be stale. We can’t know if they’re still valid without reading the code underneath them. If something is important enough to be protected with a comment, it should be protected with unit tests, build errors, or lint errors instead. That way it gets noticed regardless of how attentive future readers are, and it’s better protected against staleness. This approach also saves mental bandwidth when people are touching nearby code. Unlike standard comments, test assertions only pop when the code they’re explaining has changed. When they’re not needed, they stay out of the way.

Clean Up

Having mostly skipped the reading phase, we now have plenty of time to polish up our code. This is also an opportunity to revisit areas that gave us trouble on the downswing. If we had to read through any code manually, now’s the time to fix that for future (non)readers.

Update the Linter

If we need to enforce a standard practice, such as using a specific library or a shared pattern, codify it in the linter so future developers don’t have to find it themselves. This can trigger larger-scale refactors, so it may be worth spinning off into a separate changeset.

Lean on the Type System

Wherever practical, we turn primitive types (bools, numbers, and strings) into custom types, so future developers know which methods will give them valid outputs to feed into a given input. A primitive like timeInMilliseconds: number is more vulnerable to mistakes than time: MillisecondsType, which will raise a build error if it receives a value in SecondsType. When using enums, we enforce exhaustive switches, so a build error will appear any time a new case may need to be handled.

We also check methods for any non-independent arguments:

  • Argument A must always be null if Argument B is non-null, and vice versa (for example, error/response).
  • If Argument A is passed in, Argument B must also be passed in (for example, eventId/eventTimestamp).
  • If Flag A is off, Flag B can’t possibly be on (for example, visible/highlighted).

If these arguments are kept separate, future developers will need to think about whether they’re passing in a valid combination of arguments. Instead, we combine them, so the type system will only allow valid combinations:

  • If one argument must be null when the other is non-null, combine them into a tagged union: {type: 'failure'; error: ErrorType} | {type: 'success'; response: ResponseType}.
  • If two arguments must be passed in together, nest them into a single object: event: {id: IDType; timestamp: TimestampType}.
  • If two flags don’t vary independently, combine them into a single enum: 'hidden'|'visible'|'highlighted'.

Optimize for Simple Unit Tests

When testing, avoid entanglement with UI, disk or database access, the network, async code, current date and time, or shared state. All of these factors produce or consume side effects, clogging up the tests with setup and teardown. Not only does this spike the rate of false positives, it forces future developers to learn lots of context in order to interpret a real failure.

Instead, we want to structure our code so that we can write simple tests. As we saw, people can often skip reading our application code. When test failures appear, they have to interact with them. If they can understand the failure quickly, they’re more likely to pay attention to it, rather than adjusting the failing assertion and moving on. If a test is starting to get complicated, go back to the application code and break it into smaller pieces. Move any what code (code that decides which side effects should happen) into pure functions, separate from the how code (code that actually performs the side effects). Once we’re done, the how code won’t contain any nontrivial logic, and the what code can be tested—and therefore documented—without complex mocks.

Trivial vs. Nontrivial Logic
Trivial logic would be something like if (shouldShow) show(). Something like if (newUser) show() is nontrivial (business) logic, because it’s specific to our application or feature. We can’t be sure it’s correct unless we already know the expected behavior.

Whenever we feel an urge to write a comment, that’s a signal to add more tests. Split the logic out into its own unit tested function so the “comment” will appear automatically, regardless of how carefully the next developer is reading our code.

We can also add UI and integration tests, if desired. However, be cautious of the impulse to replace unit tests with other kinds of tests. That usually means our code requires too much reading. If we can’t figure out a way to run our code without lengthy setup or mocks, humans will need to do a similar amount of mental setup to run our code in their heads. Rather than avoiding unit tests, we need to chunk our code into smaller pieces until the unit tests become easy.


Once we’ve finished polishing our code, we manually test it for any issues. This may seem late, but we’ve converted many runtime bugs into lint, build, and test errors. Surprisingly often, we’ll find that we’ve already handled all the edge cases, even if we’re running the code for the first time.

If not, we can do a couple more passes to address the lingering issues… adjusting the code for better “unread”-ability as we go.

Sometimes, our end goal really is to read the code. For example, we might be reviewing someone else’s code, verifying the current behavior, or ruling out bugs. We can still pose our questions as writes:

  • Could a developer have done this accidentally, or does the linter block it when we try?
  • Is it possible to pass this bad combination of arguments, or would that be rejected at build time?
  • If we hardcode this value, which features (represented by unit tests) would stop working?

JM Neri is a senior mobile developer on the Shop Pay team, working out of Colorado. When not busy writing unit tests or adapting components for larger text sizes, JM is usually playing in or planning for a TTRPG campaign.

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

A Flexible Framework for Effective Pair Programming

A Flexible Framework for Effective Pair Programming

Pair programming is one of the most important tools we use while mentoring early talent in the Dev Degree program. It’s an agile software development technique where two people work together, either to share context, solve a problem, or learn from one another. Pairing builds technical and communication skills, encourages curiosity and creative problem-solving, and brings people closer together as teammates.

In my role as a Technical Educator, I’m focused on setting new interns joining the Dev Degree program up for success in their first 8 months at Shopify. Because pair programming is a method we use so frequently in onboarding, I saw an opportunity to streamline the process to make it more approachable for people who might not have experienced it before. I developed this framework during a live workshop I hosted at RenderATL. I hope it helps you structure your next pair programming session!

Pair Programming in Dev Degree

“Pair programming was my favorite weekly activity while on the Training Path. When I first joined my team, I was constantly pairing with my teammates. My experiences on the Training Path made these new situations manageable and incredibly fun! It was also a great way to socialize and work technically with peers outside of my school cohort that I wouldn't talk to often. I've made some good friends just working on a little project for the weekly module.” - Mikail Karimi, Dev Degree Intern

In the Dev Degree program, we use mentorship to build up the interns’ technical and soft skills. As interns are usually early in their career journey, having recently graduated from high school or switched careers, their needs differ from someone with more career experience. Mentorship from experienced developers is crucial to prepare interns for their first development team placement. It shortens the time it takes to start making a positive impact with their work by developing their technical and soft skills like problem solving and communication. This is especially important now that Shopify is digital by design, as learning and working remotely is a completely new experience to many interns. 

All first year Dev Degree interns at Shopify go through an eight-month period known as the Training Path. During this period, we deliver a set of courses designed to teach them all about Shopify-specific development. They’re mentored by Student Success Specialists, who coach them on building their soft skills like communication, and Technical Instructors , who focus on the technical aspects of the training. Pair programming with a peer or mentor is a great way to support both of these areas of development.

Each week, we allocate two to three hours for interns to pair program with each other on a problem. We don’t expect them to solve the problem completely, but they should use the concepts they learned from the week to hone their technical craft.

We also set up a bi-weekly 30 minute pair programming sessions with each intern. The purpose of these sessions is to provide dedicated one-on-one time to learn and work directly with an instructor. They can share what they are having trouble with, and we help them work through it.

“When I’m switching teams and disciplines, pair programming with my new team is extremely helpful to see what resources people use to debug, the internal tools they use to find information and how they approach a problem. On my current placement, I got better at resolving problems independently when I saw how my mentor handled a new problem neither of us had seen.” Sanaa Syed, Dev Degree Intern

As we scale up the program, there are some important questions I keep returning to:

  • How do we track their progress most effectively?
  • How do we know what they want to pair on each day?
  • How can we provide a safe space?
  • What are some best practices for communicating?

I started working on a framework to help solve these issues. I know I’m not the only one on my team who may be asking themselves this. Along the way, an opportunity arose to do a workshop at RenderATL. At Shopify, we are encouraged to learn as part of our professional development. Wanting to level up my public speaking skills, I decided to talk about mentorship through a pair programming lens. As the framework was nearly completed, I decided to crowdsource and finish the framework together with the RenderATL attendees.

Crowdsourcing a Framework for All

On June 1st, 2022, Shopify hosted free all-day workshops at RenderATL called Heavy Hitting React at Shopify Engineering. It contained five different workshops, covering a range of topics from specific technical skills like React Native to broader skills like communication. We received a lot of positive feedback, met many amazing folks, and made sure those who attended gained new knowledge or skills they could walk away with.

For my workshop, Let’s Pair Program a Framework Together, we pair programmed a pair programming framework. The goal was to crowdsource and finish the pair programming framework I was working on based on the questions I mentioned above. We had over 30 attendees, and the session was structured to be interactive. I walked the audience through the framework and got their suggestions on the unfinished parts of the framework. At the end, the attendees paired up and used the framework to work together and draw a picture they both wanted to draw.

Before the workshop, I sent a survey internally asking developers a few questions about pair programming. Here are the results:

  • 62.5% had over 10 years of programming experience
  • 78.1% had pair programmed before joining Shopify
  • 50% of the surveyor pair once or twice a week at Shopify

When asked “What is one important trait to have when pair programming?”, this is what Shopify developers had to say:


  • Expressing thought processes (what you’re doing, why you’re making this change, etc.)
  • Sharing context to help others get a thorough understanding
  • Use of visual aids to assist with explanation


  • Being aware of energy levels
  • Not being judgemental to others


  • Curious to learn
  • Willingness to take feedback and improve
  • Don’t adhere to just one’s opinion


  • Providing time to allow your partner to think and formulate opinions
  • Encourage repetition of steps, instructions to encourage question asking and learn by doing

Now, let’s walk through the crowdsourced framework we finished at RenderATL.

Note: For those who attended the workshop, the framework below is the same framework that you walked away with, but with more details and resources.

The Framework

A women with her back to the viewer writing on a whiteboard
Pair programming can be used for more than just writing code. Try pairing on other problems and using tools like a whiteboard to work through ideas together.

This framework covers everything you need to run a successful pair programming session, including: roles, structure, agenda, environment, and communication. You can pick and choose within each section to design your session based on your needs.

1a. Pair Programming Styles

There are many different ways to run a pair programming session. Here are the ones we found to be the most useful, and when you may want to use each depending on your preferences and goals.

Driver and Navigator

Think about this style like a long road trip. One person is focused on driving to get from point A to B. While the other person is providing directions, looking for future pit stops for breaks, and just observing the surroundings. As driving can be taxing, it’s a good idea to switch roles frequently.

The driver is the person leading the session and typing on the keyboard. As they are typing, they’re explaining their thought process. The navigator, also known as the observer, is the person observing, reviewing code that’s being written, and making suggestions along the way. For example, suggesting refactoring code and thinking about potential edge cases.

If you’re an experienced person pairing with an intern or junior developer, I recommend using this style after you paired together a few sessions. They’re likely still gaining context and getting comfortable with the code base in the first few sessions.

Tour Guide

This style is like giving someone a personal tour of the city. The experienced person drives most of the session, hence the title tour guide. While the partner is just observing and asking questions along the way.

I suggest using this style when working with someone new on your team. It’s a great way to give them a personal tour to how your team’s application works and share context along the way. You can also flip it, where the least experienced person is the tour guide. I like to do this with the Dev Degree interns who are a bit further into their training when I pair with them. I find it helps bring out their communication skills once they’ve started to gain some confidence in their technical abilities.


The unstructured style is more of a freestyle way to work on something together, like learning a new language or concept. The benefit of the unstructured style is the team building and creative solutions that can come from two people hacking away and figuring things out. This is useful when a junior developer or intern pairs with someone at their level. The only downside is that without a mentor overseeing them, there’s a risk of missteps or bad habits going unchecked. This can be solved after the session by having them share their findings with a mentor for discussion.

We allocated time for the interns to pair together. This is the style the interns typically go with, figuring things out using the concepts they learned.

1b. Activities

When people think of pair programming, they often strictly think of coding. But pair programming can be effective for a range of activities. Here are some we suggest.

Code Reviews

I remember when I first started reviewing code, I wasn’t sure what exactly I was meant to be looking for. Having an experienced mentor support with code reviews helps early talent pick up context and catch things that they might not otherwise know to look for. Interns also bring a fresh perspective, which can benefit mentors as well by prompting them to unpack why they might make certain decisions.

Technical Design and Documentation

Working together to design a new feature or go through team documents. If you put yourself in a junior developer’s shoes, what would that look like? It could look like a whiteboarding session mapping out logic for the new feature or improving the team’s onboarding documentation. This could be an incredibly impactful session for them. You’ll be helping broaden their technical depth, helping future teammates onboard faster, and sharing your expertise along the way.

Writing Test Cases Only

Imagine you’re a junior developer working on your very first task. You have finished writing the functionality, but haven’t written any tests for it. You tested it manually and know it works. One thing you’re trying to figure out is how to write a test for it now. This is where a pair programming session with an experienced developer is beneficial. You work together to extend testing coverage and learn team-specific styles when writing tests.


Pairing is a great way to help onboard someone new onto your team. It helps the new person joining your team ramp up quicker with your wealth of knowledge. Together you explore the codebase, documentation, and team-specific rituals.

Hunting Bugs

Put yourselves in your users’ shoes and go on a bug hunt together. As you test functionalities on production, you'll gain context on the product and reduce the number of bugs on your application. A win-win!

2. Set an Agenda

A photo of a women writing in a notebook on a desk with a coffee and croissant
Pair programming can be used for more than just writing code. Try pairing on other problems and using tools like a whiteboard to work through ideas together.

Setting an agenda beforehand is key to making sure your session is successful.

Before the session, work with your partner to align on the style, activity, and goals for your pair programming session. This way you can hit the ground running while pairing and work together to achieve a common goal. Here are some questions you can use to set your agenda: 

  • What do you want to pair on today?
  • How do you want the session to go? You drive? I drive? Or both?
  • Where should we be by the end of the session?
  • Is there a specific skill you want to work on?
  • What’s blocking you?

3. Set the Rules of Engagement

A classroom with two desks and chairs in the foreground 
Think of your pair programming session like a classroom. How would you make sure it’s a great environment to learn?

"After years of frequent pair programming, my teammates and I have established patterns that give the impression that we are always right next to each other, which makes context sharing and learning from peers much simpler." -Olakitan Bello, Developer

Now that your agenda is set, it’s time to think about the environment you want to have during the session. Imagine yourself as a teacher. If this was a classroom, how would you provide the best learning environment?

Be Inclusive

Everyone should feel welcomed and invited to collaborate with others. One way we can set the tone is to establish that “There are no wrong answers here” or “There are no dumb questions.” If you’re a senior colleague, saying “I don’t know” to your partner is a very powerful thing. It shows that you’re a human too! Keep accessibility in mind as well. There are tools and styles available to tailor pair programming sessions to the needs of you and your partner. For example, there are alternatives to verbal communication, like using a digital whiteboard or even sending messages over a communication platform. Invite people to be open about how they work best and support each other to create the right environment.

Remember That Silence Isn’t Always Golden

If it gets very quiet as you’re pairing together, it’s usually not a good sign. When you pair program with someone, communication is very important to both parties. Without it, it’s hard for one person to perceive the other person’s thoughts and feelings. Make a habit of explaining your thought process out loud as you work. If you need a moment to gather your thoughts, simply let your partner know instead of going silent on them without explanation.

Respect Each Other

Treat them like how you want to be treated, and value all opinions. Someone's opinions can help lead to an even greater solution. Everyone should be able to contribute and express their opinions.

Be Empathic Not Apathetic

If you’re pair programming remotely, displaying empathy goes a long way. As you’re pair programming with them, read the room. Do they feel flustered with you driving too fast? Are you aware of their emotional needs at the moment?

As you’re working together, listen attentively and provide them space to contribute and formulate opinions.

Mistakes Are Learning Opportunities

If you made a mistake while pair programming, don’t be embarrassed about it. Mistakes happen, and are actually opportunities to learn. If you notice your partner make a mistake, point it out politely—no need to make a big deal of it.

4. Communicate Well

A room with five happy people talking in a meeting
Communication is one of the most important aspects of pair programming.

Pair programming is all about communication. For two people to work together to build something, both need to be on the same page. Remote work can introduce unique communication challenges, since you don’t have the benefit of things like body language or gestures that come with being in the same physical room. Fortunately, there’s great tooling available, like Tuple, to solve these challenges and even enhance the pair programming experience. It’s a MacOS only application which allows people to pair program with each other remotely. Users can share their screen and either can take control to drive. The best part is that it’s a seamless experience without any additional UI taking up space on your screen.

During your session, use these tips to make sure you’re communicating with intention. 

Use Open-ended Questions

Open-ended questions lead to longer dialogues and provide a moment for someone to think critically. Even if they don’t know, it lets them learn something new that they will take away from the session. With closed questions, it’s usually a “yes” or a “no” only. Let’s say we’re working together on building a grouped React component of buttons, which one sounds more inviting for a discussion:

  • Is ButtonGroup a good name for the component? (Close-ended question)
  • What do you think of the name ButtonGroup? (Open-ended question)

Other examples of open-ended questions:

  • What are some approaches you took to solving this issue?
  • Before we try this approach, what do you think will happen?
  • What do you think this block of code is doing?

Give Positive Affirmations

Encouragement goes a long way, especially when folks are finding their footing early in their career. After all, knowing what’s gone right can be just as important as knowing what’s gone wrong. Throughout the session, pause and celebrate progress by noting when you see your partner do something well.

For example, you and your partner are building a new endpoint that’s part of a new feature your team is implementing. Instead of waiting until the feature is released, celebrate the small win.

Here are a few example messages you can give:

  • It looks like we marked everything off the agenda. Awesome session today.
  • Great work on catching the error. This is why I love pair programming because we work together as a team.
  • Huge win today! The PR we worked together is going to help not only our team, but others as well.

Communication Pitfalls to Avoid

No matter how it was intended, a rude or condescending comment made in passing can throw off the vibe of a pair programming session and quickly erode trust between partners. Remember that programming in front of someone can be a vulnerable experience, especially for someone just starting out. While it might seem obvious, we all need a reminder sometimes to be mindful of what we say. Here are some things to watch out for.

Passive-Aggressive (Or Just-Plain-Aggressive) Comments

Passive-aggressive behavior is when someone expresses anger or negative feelings in an indirect way. If you’re feeling frustrated during a session, avoid slipping into this behavior. Instead, communicate your feelings directly in a constructive way. When negative feelings are expressed in a hostile or outwardly rude way, this is aggressive behavior and should be avoided completely. 

Passive-aggressive behavior examples:

  • Giving your partner the silent treatment
  • Eye-rolling or sighing in frustration when your partner makes a mistake
  • Sarcastic comments like “Could you possibly code any slower?” 
  • Subtle digs like “Well…that’s an interesting idea.”

Aggressive behavior examples

  • “Typical intern mistake, rookie.”
  • “This is NOT how someone at your level should be working.”
  • “I thought you knew how to do this.”

Absolute Words Like Always and Never

Absolute words means you are 100 percent certain. In oral communication, depending on the use of the absolute word, it can sound condescending. Programming is also a world full of nuance, so overgeneralizing solutions as right or wrong is often misleading. Instead, use these scenarios as a teaching opportunity. If something is usually true, explain the scenarios where there might be exceptions. If a solution rarely works, explain when it might. Edge cases can open some really interesting and valuable conversations.

For example:

  • “You never write perfect code”
  • “I always review code”
  • “That would never work”

Use alternative terms instead:

  • For always, you can use usually
  • For never, you can use rarely

“When you join Shopify, I think it's overwhelming given the amount of resources to explore even for experienced people. Pair programming is the best way to gain context and learn. In this digital by design world, pair programming really helped me to connect with team members and gain context and learn how things work here in Shopify which helped me with faster onboarding.” -Nikita Acharya, Developer

I want to thank everyone at RenderATL who helped me finish this pair programming framework. If you’re early on in your career as a developer, pair programming is a great way to get to know your teammates and build your skills. And if you’re an experienced developer, I hope you’ll consider mentoring newer developers using pair programming. Either way, this framework should give you a starting point to give it a try.

In the pilot we’ve run so far with this framework, we’ve received positive feedback from our interns about how it allowed them to achieve their learning goals in a flexible format. We’re still experimenting and iterating with it, so we’d love to hear your feedback if you give it a try! Happy pairing!

Raymond Chung is helping to build a new generation of software developers through Shopify’s Dev Degree program. As a Technical Educator, his passion for computer science and education allows him to create bite-sized content that engages interns throughout their day-to-day. When he is not teaching, you’ll find Raymond exploring for the best bubble tea shop. You can follow Raymond on Twitter, GitHub, or LinkedIn.

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

Lessons From Building Android Widgets

Lessons From Building Android Widgets

By Matt Bowen, James Lockhart, Cecilia Hunka, and Carlos Pereira

When the new widget announcement was made for iOS 14, our iOS team went right to work designing an experience to leverage the new platform. However, widgets aren’t new to Android and have been around for over a decade. Shopify cares deeply about its mobile experience and for as long as we’ve had the Shopify mobile app, both our Android and iOS teams ship every feature one-to-one in unison. With the spotlight now on iOS 14, this was a perfect time to revisit our offering on Android.

Since our contribution was the same across both platforms, just like our iOS counterparts at the time, we knew merchants were using our widgets, but they needed more.

Why Widgets are Important to Shopify

Our widgets mainly focus on analytics that help merchants understand how they’re doing and gain insights to make better decisions quickly about their business. Monitoring metrics is a daily activity for a lot of our merchants, and on mobile, we have the opportunity to give merchants a faster way to access this data through widgets. They provide merchants a unique avenue to quickly get a pulse on their shops that isn’t available on the web.

Add Insights widget
Add Insights widget

After gathering feedback and continuously looking for opportunities to enhance our widget capabilities, we’re at our third iteration, and we’ll share with you the challenges we faced and how we solved them.

Why We Didn’t Use React Native

A couple of years ago Shopify decided to go full on React Native. New development should be done in React Native, and we’re also migrating some apps to the technology. This includes the flagship admin app, which is the companion app to the widgets.

Then why not write the widgets in React Native?

After doing some initial investigation, we quickly hit some roadblocks (like the fact that RemoteViews are the only way to create widgets). There’s currently no official support in the React Native community for RemoteViews, which is needed to support widgets. This felt very akin to a square peg in a round hole. Our iOS app also ran into issues supporting React Native, and we were running down the same path. Shopify believes in using the right tool for the job, we believe that native development was the right call in this case.

Building the Widgets

When building out our architecture for widgets, we wanted to create a consistent experience on both Android and iOS while preserving platform idioms where it made sense. In the sections below, we want to give you a view of our experiences building widgets. Pointing out some of the more difficult challenges we faced. Our aim is to shed some light around these less used surfaces, hopefully give some inspiration, and save time when it comes to implementing widgets in your applications.

Fetching Data

Some types of widgets have data that change less frequently (for example, reminders) and some that can be forecasted for the entire day (for example, calendar and weather). In our case, the merchants need up-to-date metrics about their business, so we need to show data as fresh as possible. Delays in data can cause confusion, or even worse, delay information that could change an action. Say you follow the stock market, you expect the stock app and widget data to be as up to date as possible. If the data is multiple hours stale, you may have missed something important! For our widgets to be valuable, we need information to be fresh while considering network usage.

Fetching Data in the App

Widgets can be kept up to date with relevant and timely information by using data available locally or fetching it from a server. The server fetching can be initiated by the widget itself or by the host app. In our case, since the app doesn’t need the same information the widget needed, we decided it would make more sense to fetch it from the widget. 

One benefit to how widgets are managed in the Android ecosystem over iOS is the flexibility. On iOS you have limited communication between the app and widget, whereas on Android there doesn’t seem to be the same restrictions. This becomes clear when we think about how we configure a widget. The widget configuration screen has access to all of the libraries and classes that our main app does. It’s no different than any other screen in the app. This is mostly true with the widget as well. We can access the resources contained in our main application, so we don’t need to duplicate any code. The only restrictions in a widget come with building views, which we’ll explore later.

When we save our configuration, we use shared preferences to persist data between the configuration screen and the widget. When a widget update is triggered, the shared preferences data for a given widget is used to build our request, and the results are displayed within the widget. We can read that data from anywhere in our app, allowing us to reuse this data in other parts of our app if desired.

Making the Widgets Antifragile

The widget architecture is built in a way that updates are mindful of battery usage, where updates are controlled by the system. In the same way, our widgets must also be mindful of saving bandwidth when fetching data over a network. While developing our second iteration, we came across a peculiar problem that was exacerbated by our specific use case. Since we need data to be fresh, we always pull new data from our backend on every update. Each update is approximately 15 minutes apart to avoid having our widgets stop updating. What we found is that widgets call their update method onUpdate(), more than once in an update cycle. In widgets like calendar, these extra calls come without much extra cost as the data is stored locally. However, in our app, this was triggering two to five extra network calls for the same widget with the same data in quick succession.

In order to correct the unnecessary roundtrips, we built a simple short-lived cache:

  1. The system asks our widget to provide new data from Reportify (Shopify’s data service)
  2. We first look into the local cache using the widgetID provided by the system.
  3. If there’s data, and that data was set less than one minute ago, we return it and avoid making a network request. We also take into account configuration such as locale so as not to avoid forcing updates after a language change.
  4. Otherwise, we fetch the data as normal and store it in the cache with the timestamp.
A flow diagram highlighting the steps of the cache
The simple short-lived cache flow

With this solution, we reduced unused network calls and system load and avoided collecting incorrect analytics.

Implementing Decoder Strategy with Dynamic Selections

We follow a similar approach as we have on iOS. We create a dynamic set of queries based on what the merchant has configured.

For each metric we have a corresponding definition implementation. This approach allows each metric the ability to have complete flexibility around what data it needs, and how it decodes the data from the response.

When Android asks us to update our widgets, we pull the merchants selection from our configuration object. Since each of the metric IDs has a definition, we map over them to create a dynamic set of queries.

We include an extension on our response object that binds the definitions to a decoder. Our service sends back an array of the response data corresponding to the queries made. We map over the original definitions, decoding each chunk to the expected return type.

Building the UI

Similar to iOS, we support three widget sizes for versions prior to Android 12 and follow the same rules for cell layout, except for the small widget. The small widget on Android supports a single metric (compared to the two on iOS) and the smallest widget size on Android is a 2x1 grid. We quickly found that only a single metric would fit in this space, so this design differs slightly between the platforms.

Unlike iOS with swift previews, we were limited to XML previews and running the widget on the emulator or device. We’re also building widgets dynamically, so even XML previews were relatively useless if we wanted to see an entire widget preview. Widgets are currently on the 2022 Jetpack Compose roadmap, so this is likely to change soon with Jetpack composable previews.

With the addition of dynamic layouts in Android 12, we created five additional sizes to support each size in between the original three. These new sizes are unique to Android. This also led to using grid sizes as part of our naming convention as you can see in our WidgetLayout enum below.

For the structure of our widget, we used an enum that acts as a blueprint to map the appropriate layout file to an area of our widget. This is particularly useful when we want to add a new widget because we simply need to add a new enum configuration.

To build the widgets dynamically, we read our configuration from shared preferences and provide that information to the RemoteViews API.

If you’re familiar with the RemoteViews API, you may notice the updateView() method, which is not a default RemoteViews method. We created this extension method as a result of an issue we ran into while building our widget layout in this dynamic manner. When a widget updates, the new remote views get appended to the existing ones. As you can probably guess, the widget didn’t look so great. Even worse, more remote views get appended on each subsequent update. We found that combining the two RemoteViews API methods removeAllViews() and addView() solved this problem.

Once we build our remote views, we then pass the parent remote view to the AppWidgetProvider updateAppWidget() method to display the desired layout.

It’s worth noting that we attempted to use partiallyUpdateAppWidget() to stop our remote views from appending to each other, but encountered the same issue.

Using Dynamic Dates

One important piece of information on our widget is the last updated timestamp. It helps remove confusion by allowing the merchants to quickly know how fresh the data is they are looking at. If the data is quite stale (say you went to the cottage for the weekend and missed a few updates) and there wasn’t a displayed timestamp, you would assume the data you’re looking at is up to the second. This can cause unnecessary confusion for our merchants. The solution here was to ensure there’s some communication to our merchant when the last update was made.

In our previous design, we only had small widgets, and they were able to display only one metric. This information resulted in a long piece of text that, on smaller devices, would sometimes wrap and show over two lines. This was fine when space was abundant in our older design but not in our new data rich designs. We explored how we could best work with timestamps on widgets, and the most promising solution was to use relative time. Instead of having a static value such as “as of 3:30pm” like our previous iteration. We would have a dynamic date that would look like: “1 min, 3 sec ago.”

One thing to remember is that even though the widget is visible, we have a limited number of updates we can trigger. Otherwise, it would be consuming a lot of unnecessary resources on the device. We knew we couldn’t keep triggering updates on the widget as often as we wanted. Android has a strategy for solving this with TextClock. However, there’s no support for relative time, which wouldn’t be useful in our use case. We also explored using Alarms, but potentially updating every minute would consume too much battery. 

One big takeaway we had from these explorations was to always test your widgets under different conditions. Especially low battery or poor network. These surfaces are much more restrictive than general applications and the OS is much more likely to ignore updates.

We eventually decided that we wouldn’t use relative time and kept the widget’s refresh time as a timestamp. This way we have full control over things like date formatting and styling.

Adding Configuration

Our new widgets have a great deal of configuration options, allowing our merchants to choose exactly what they care about. For each widget size, the merchant can select the store, a certain number of metrics and a date range. This is the only part of the widget that doesn’t use RemoteViews, so there aren’t any restrictions on what type of View you may want to use. We share information between the configuration and the widget via shared preferences.

Insights widget configuration
Insights widget configuration

Working with Charts and Images

Android widgets are limited to RemoteViews as their building blocks and are very restrictive in terms of the view types supported. If you need to support anything outside of basic text and images, you need to be a bit creative.

Our widgets support both a sparkline and spark bar chart built using the MPAndroidCharts library. We have these charts already configured and styled in our main application, so the reuse here was perfect; except, we can’t use any custom Views in our widgets. Luckily, this library is creating the charts via drawing to the canvas, and we simply export the charts as a bitmap to an image view. 

Once we were able to measure the widget, we constructed a chart of the required size, create a bitmap, and set it to a RemoteView.ImageView. One small thing to remember with this approach, is that if you want to have transparent backgrounds, you’ll have to use ARGB_8888 as the Bitmap Config. This simple bitmap to ImageView approach allowed us to handle any custom drawing we needed to do. 

Eliminating Flickering

One minor, but annoying issue we encountered throughout the duration of the project is what we like to call “widget flickering.” Flickering is a side-effect of the widget updating its data. Between updates, Android uses the initialLayout from the widget’s configuration as a placeholder while the widget fetches its data and builds its new RemoteViews. We found that it wasn’t possible to eliminate this behavior, so we implemented a couple of strategies to reduce the frequency and duration of the flicker.

The first strategy is used when a merchant first places a widget on the home screen. This is where we can reduce the frequency of flickering. In an earlier section “Making the Widgets Antifragile,” we shared our short-lived cache. The cache comes into play because the OS will trigger multiple updates for a widget as soon as it’s placed on the home screen. We’d sometimes see a quick three or four flickers, caused by updates of the widget. After the widget gets its data for the first time, we prevent any additional updates from happening for the first 60 seconds, reducing the frequency of flickering.

The second strategy reduces the duration of a flicker (or how long the initialLayout is displayed). We store the widgets configuration as part of shared preferences each time it’s updated. We always have a snapshot of what widget information is currently displayed. When the onUpdate() method is called, we invoke a renderEarlyFromCache() method as early as possible. The purpose of this method is to build the widget via shared preferences. We provide this cached widget as a placeholder until the new data has arrived.

Gathering Analytics

Largest Insights widget in light mode
Largest widget in light mode

Since our first widgets were developed, we added strategic analytics in key areas so that we could understand how merchants were using the functionality.  This allowed us to learn from the usage so we could improve on them. The data team built dashboards displaying detailed views of how many widgets were installed, the most popular metrics, and sizes.

Most of the data used to build the dashboards came from analytics events fired through the widgets and the Shopify app.

For these new widgets, we wanted to better understand adoption and retention of widgets, so we needed to capture how users are configuring their widgets over time and which ones are being added or removed.

Detecting Addition and Removal of Widgets 

Unlike iOS, capturing this data in Android is straight-forward. To capture when a merchant adds a widget, we send our analytical event when the configuration is saved. When removing a widget, the widgets built-in onDeleted method gives us the widget ID of the removed widget. We can then look up our widget information in shared preferences and send our event prior permanently deleting the widget information from the device. 

Supporting Android 12

When we started development of our new widgets, our application was targeting Android 11. We knew we’d be targeting Android 12 eventually, but we didn’t want the upgrade to block the release. We decided to implement Android 12 specific features once our application targeted the newer version, leading to an unforeseen issue during the upgrade process with widget selection.

Our approach to widget selection in previous versions was to display each available size as an option. With the introduction of responsive layouts, we no longer needed to display each size as its own option. Merchants can now pick a single widget and resize to their desired layout. In previous versions, merchants can select a small, medium, and large widget. In versions 12 and up, merchants can only select a single widget that can be resized to the same layouts as small, medium, large, and several other layouts that fall in between. This pattern follows what Google does with their large weather widget included on devices, as well as an example in their documentation. We disabled the medium and small widgets in Android 12 by adding a flag to our AndroidManifest and setting that value in attrs.xml for each version:

The approach above behaves as expected, the medium and small widgets were now unavailable from the picker. However, if a merchant was already on Android 12 and had added a medium or large widget before our widget update, those widgets were removed from their home screen. This could easily be seen as a bug and reduce confidence in the feature. In retrospect, we could have prototyped what the upgrade would have looked like to a merchant who was already on Android 12.

Allowing only the large widget to be available was a data-driven decision. By tracking widget usage at launch, we saw that the large widget was the most popular and removing the other two would have the least impact on current widget merchants.

Building New Layouts

We encountered an error when building the new layouts that fit between the original small, medium and large widgets.

After researching the error, we were exceeding the Binder transaction buffer. However, the buffer’s size is 1mb and the error displayed .66mb, which wasn’t exceeding the documented buffer size. The error has appeared to stump a lot of developers. After experimenting with ways to get the size down, we found that we could either drop a couple of entire layouts or remove support for a fourth and fifth row of the small metric. We decided on the latter, which is why our 2x3 widget only has three rows of data when it has room for five.

Rethinking the Configuration Screen

Now that we have one widget to choose from, we had to rethink what our configuration screen would look like to a merchant. Without our three fixed sizes, we could no longer display a fixed number of metrics in our configuration. 

Our only choice was to display the maximum number of metrics available for the largest size (which is seven at the time of this writing). Not only did this make the most sense to us from a UX perspective, but we also had to do it this way because of how the new responsive layouts work. Android has to know all of the possible layouts ahead of time. Even if a user shrinks their widget to a size that displays a single metric, Android has to know what the other six are, so it can be resized to our largest layout without any hiccups.

We also updated the description that’s displayed at the top of the configuration screen that explains this behavior.

Capturing More Analytics

On iOS, we capture analytical data when a merchant reconfigures a widget to gain insights into usage patterns. Reconfiguration for Android was only possible in version 12 and due to the limitations of the AppWidgetProvider’s onAppWidgetOptionsChanged() method, we were unable to capture this data on Android.

I’ll share more information about building our layouts in order to give context to our problem. Setting breakpoints for new dynamic layouts works very well across all devices. Google recommends creating a mapping of your breakpoints to the desired remote view layout. To build on a previous example where we showed the buildWidgetRemoteView() method, we used this method again as part of our breakpoint mapping. This approach allows us to reliably map our breakpoints to the desired widget layout.

When reconfiguring or resizing a widget, the onAppWidgetOptionsChanged() method is called. This is where we’d want to capture our analytical data about what had changed. Unfortunately,  this view mapping doesn’t exist here. We have access to width and height values that are measured in dp, initially appearing to be useful. At first, we felt that we could discern these measurements into something meaningful and map these values back to our layout sizes. After testing on a couple of devices, we realized that the approach was unreliable and would lead to a large volume of bad analytical data. Without confidently knowing what size we are coming from, or going to, we decided to omit this particular analytics event from Android. We hope to bring this to Google’s attention, and get it included in a future release.

Shipping New Widgets

Already having a pair of existing widgets, we had to decide how to transition to the new widgets as they would be replacing the existing implementation.

We didn’t find much documentation around migrating widgets. The docs only provided a way to enhance your widget, which means adding the new features of Android 12 to something you already had. This wasn’t applicable to us since our existing widgets were so different from the ones we were building.

The major issue that we couldn’t get around was related to the sizing strategies of our existing and new widgets. The existing widgets used a fixed width and height so they’d always be square. Our new widgets take whatever space is available. There wasn’t a way to guarantee that the new widget would fit in the available space that the existing widget had occupied. If the existing widget was the same size as the new one, it would have been worth exploring further.

The initial plan we had hoped for, was to make one of our widgets transform into the new widget while removing the other one. Given the above, this strategy would not work.

The compromise we came to, so as to not completely remove all of a merchant’s widgets overnight, was to deprecate the old widgets at the same time we release the new one. To deprecate, we updated our old widget’s UI to display a message informing that the widget is no longer supported and the merchant must add the new ones.

Screen displaying the notice: This widget is no longer active. Add the new Shopify Insights widget for an improved view of your data. Learn more.
Widget deprecation message

There’s no way to add a new widget programmatically or to bring the merchant to the widget picker by tapping on the old widget. We added some communication to help ease the transition by updating our help center docs, including information around how to use widgets, pointing our old widgets to open the help center docs, and just giving lots of time before removing the deprecation message. In the end, it wasn’t the most ideal situation and we came away learning about the pitfalls within the two ecosystems.

What’s Next

As we continue to learn about how merchants use our new generation of widgets and Android 12 features, we’ll continue to hone in on the best experience across both our platforms. This also opens the way for other teams at Shopify to build on what we’ve started and create more widgets to help Merchants. 

On the topic of mobile only platforms, this leads us into getting up to speed on WearOS. Our WatchOS app is about to get a refresh with the addition of Widgetkit; it feels like a great time to finally give our Android Merchants watch support too!

Matt Bowen is a mobile developer on the Core Admin Experience team. Located in West Warwick, RI. Personal hobbies include exercising and watching the Boston Celtics and New England Patriots.

James Lockhart is a Staff Developer based in Ottawa, Canada. Experiencing mobile development over the past 10+ years: from Phonegap to native iOS/Android and now React native. He is an avid baker and cook when not contributing to the Shopify admin app.

Cecilia Hunka is a developer on the Core Admin Experience team at Shopify. Based in Toronto, Canada, she loves live music and jigsaw puzzles.

Carlos Pereira is a Senior Developer based in Montreal, Canada, with more than a decade of experience building native iOS applications in Objective-C, Swift and now React Native. Currently contributing to the Shopify admin app.

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

Continue reading

Lessons From Building iOS Widgets

Lessons From Building iOS Widgets

By Carlos Pereira, James Lockhart, and Cecilia Hunka

When the iOS 14 beta was originally announced, we knew we needed to take advantage of the new widget features and get something to our merchants. The new widgets looked awesome and could really give merchants a way to see their shop’s data at a glance without needing to open our app.

Fast forward a couple of years, and we now have lots of feedback from the new design. We knew merchants were using them, but they needed more. The current design was lacking and only provided two metrics—also, they took up a lot of space. This experience prompted us to start a new project. To upgrade our original design to better fit our merchant’s needs.

Why Widgets Are Important to Shopify

Our widgets mainly focus on analytics. Analytics can help merchants understand how they’re doing and gain insights to make better decisions quickly about their business. Monitoring metrics is a daily activity for a lot of our merchants, and on mobile, we have the opportunity to give merchants a faster way to access this data through widgets. As widgets provide access to “at a glance” information about your shop and allow merchants a unique avenue to quickly get a pulse on their shops that they wouldn’t find on desktop.

A screenshot showing the add widget screen for Insights on iOS
Add Insights widget

After gathering feedback and continuously looking for opportunities to enhance our widget capabilities, we’re at our third iteration, and we’ll share with you how we approached building widgets and some of the challenges we faced.

Why We Didn’t Use React Native

A couple years ago Shopify decided to go all in on React Native. New development was done in React Native and we began migrating some apps to the new stack. Including our flagship admin app, where we were building our widgets. Which posed the question, should we write the widgets in React Native?

After doing some investigation we quickly hit some roadblocks: app extensions are limited in terms of memory, WidgetKit’s architecture is highly optimized to work with SwiftUI as the view hierarchy is serialized to disk, there’s also, at this time, no official support in the React Native community for widgets.

Shopify believes in using the right tool for the job, we believe that native development with SwiftUI was the best choice in this case.

Building the Widgets

When building out our architecture for widgets, we wanted to create a consistent experience on both iOS and Android while preserving platform idioms where it made sense. Below we’ll go over our experience and strategies building the widgets, pointing out some of the more difficult challenges we faced. Our aim is to shed some light around these less talked about surfaces, give some inspiration for your projects, and hopefully, save time when it comes to implementing your widgets.

Fetching Data

Some types of widgets have data that change less frequently (for example, reminders) and some that can be forecasted for the entire day (for example, Calendar and weather). In our case, the merchants need up-to-date metrics about their business, so we need to show data as fresh as possible. Time for our widget is crucial. Delays in data can cause confusion, or even worse, delay information that could inform a business decision. For example, let’s say you watch the stocks app. You would expect the stock app and its corresponding widget data to be as up to date as possible. If the data is multiple hours stale, you could miss valuable information for making decisions or you could miss an important drop or rise in price. With our product, our merchants need as up to date information as we can provide them to run their business.

Fetching Data in the App

Widgets can be kept up to date with relevant and timely information by using data available locally or fetching it from a server. The server fetching can be initiated by the widget itself or by the host app. In our case, since the app doesn’t share the same information as the widget, we decided it made more sense to fetch it from the widget.

We still consider moving data fetching to the app once we start sharing similar data between widgets and the app. This architecture could simplify the handling of authentication, state management, updating data, and caching in our widget since only one process will have this job rather than two separate processes. It’s worth noting that the widget can access code from the main app, but they can only communicate data through keychain and shared user defaults as widgets run on separate processes. Sharing the data fetching, however, comes with an added complexity of having a background process pushing or making data available to the widgets, since widgets must remain functional even if the app isn’t in the foreground or background. For now, we’re happy with the current solution: the widgets fetch data independently from the app while sharing the session management code and tokens.

A flow diagram highlighting widgets fetch data independently from the app while sharing the session management code and tokens
Current solution where widgets fetch data independently

Querying Business Analytics Data with Reportify and ShopifyQL

The business data and visualizations displayed in the widgets are powered by Reportify, an in-house service that exposes data through a set of schemas queried via ShopifyQL, Shopify’s Commerce data querying language. It looks very similar to SQL but is designed around data for commerce. For example, to fetch a shops total sales for the day:

Making Our Widgets Antifragile

iOS Widgets' architecture is built in a way that updates are mindful of battery usage and are budgeted by the system. In the same way, our widgets must also be mindful of saving bandwidth when fetching data over a network. While developing our second iteration we came across a peculiar problem that was exacerbated by our specific use case.

Since we need data to be fresh, we always pull new data from our backend on every update. Each update is approximately 15 minutes apart to avoid having our widgets stop updating (which you can read about why on Apple’s Developer site). We found that iOS calls the update methods, getTimeline() and getSnapshot(), more than once in an update cycle. In widgets like calendar, these extra calls come without much extra cost as the data is stored locally. However, in our app, this was triggering two to five extra network calls for the same widget with the same data in quick succession.

We also noticed these calls were causing a seemingly unrelated kick out issue affecting the app. Each widget runs on a different process than the main application, and all widgets share the keychain. Once the app requests data from the API, it checks to see if it has an authenticated token in the keychain. If that token is stale, our system pauses updates, refreshes the token, and continues network requests. In the case of our widgets, each widget call to update was creating another workflow that could need a token refresh. When we only had a single widget or update flow, it worked great! Even four to five updates would usually work pretty well. However, eventually one of these network calls would come out of order and an invalid token would get saved. On our next update, we have no way to retrieve data or request a new token resulting in a session kick out. This was a great find as it was causing a lot of frustration for our affected merchants and ourselves, who could never really put our finger on why these things would, every now and again, just log us out.

In order to correct the unnecessary roundtrips, we built a simple short-lived cache:

  1. The system asks our widget to provide new data
  2. We first look into the local cache using a key specific to that widget. On iOS, our key is produced from a configuration for that widget as there’s no unique identifiers provided. We also take into account configuration such as locale so as not to avoid forcing updates after a language change.
  3. If there’s data, and that data was set less than one minute ago, we return it and avoid making a network request. 
  4. Otherwise, we fetch the data as normal and store it in the cache with the timestamp.
      A flow diagram highlighting the steps of the cache
      The simple short-lived cache flow

      With this solution, we reduced unused network calls and system load, avoided collecting incorrect analytics, and fixed a long running bug with our app kick outs!

      Implementing Decoder Strategy with Dynamic Selections

      When fetching the data from the Analytics REST service, each widget can be configured with two to seven metrics from a total of 12. This set should grow in the future as new metrics are available too! Our current set of metrics are all time-based and have a similar structure.

      But that doesn’t mean the structure of future metrics will not change. For example, what about a metric that contains data that isn’t mapped over a time range? (like orders to fulfill, which does not contain any historical information).

      The merchant is also able to configure the order the metrics appear, which shop (if they have more than one shop), and which date range represents the data: today, last 7 days, and last 30 days.

      We had to implement a data fetching and decoding mechanism that:

      • only fetches the data the merchant requested in order to avoid asking for unneeded information
      • supports a set of metrics as well as being flexible to add future metrics with different shapes
      • supports different date ranges for the data.

      A simplified version of the solution is shown below. First, we create a struct to represent the query to the analytics service (Reportify).

      Then, we create a class to represent the decodable response. Right now it has a fixed structure (value, comparison, and chart values), but in the future we can use an enum or different subclasses to decode different shapes.

      Next, we create a response wrapper that attempts to decode the metrics based on a list of metric types passed to it. Each metric has its configuration, so we know which class is used to read the values.

      Finally, when the widget Timeline Provider asks for new data, we fetch the data from the current metrics and decode the response. 

      Building the UI

      We wanted to support the three widget sizes: small, medium, and large. From the start we wanted to have a single View to support all sizes as an attempt to minimize UI discrepancies and make the code easy to maintain.

      We started by identifying the common structure and creating components. We ended up with a Metric Cell component that has three variations:

      A metric cell from a widget
      A metric cell
      A metric cell from a widget
      A metric cell with a sparkline
      A metric cell from a widget
      A metric cell with barchart

      All three variations consist of a metric name and value, chart, and a comparison. As the widget containers become bigger, we show the merchant more data. Each view size contains more metrics, and the largest widget contains a full width chart on the first chosen metric. The comparison indicator also gets shifted from bottom to right on this variation.

      The first chosen metric, on the large widget, is shown as a full width cell with a bar chart showing the data more clearly; we call it the Primary cell. We added a structure to indicate if a cell is going to be used as primary or not. Besides the primary flag, our component doesn’t have any context about the widget size, so we use chart data as an indicator to render a cell primary or not. This paradigm fits very well with SwiftUI.

      A simplified version of the actual Cell View:

      After building our cells, we need to create a structure to render them in a grid according to the size and metrics chosen by the merchant. This component also has no context of the widget size, so our layout decisions are mainly based on how many metrics we are receiving. In this example, we’ll refer to the View as a WidgetView.

      The WidgetView is initialized with a WidgetState, a struct that holds most of the widget data such as shop information, the chosen metrics and their data, and a last updated string (which represents the last time the widget was updated).

      To be able to make decisions on layout based on the widget characteristics, we created an OptionSet called LayoutOption. This is passed as an array to the WidgetView.

      Layout options:

      That helped us not to tie this component to Widget families, rather to layout characteristics that makes this component very reusable in other contexts.

      The WidgetView layout is built using mainly a LazyVGrid component:

      A simplified version of the actual View:

      Adding Dynamic Dates

      One important piece of information on our widget is the last updated timestamp. It helps remove confusion by allowing merchants too quickly know how fresh the data is they’re looking at. Since iOS has an approximate update time with many variables, coupled with data connectivity, it’s very possible the data could be over 15 minutes old. If the data is quite stale (say you went to the cottage for the weekend and missed a few updates) and there was no update string, you would assume the data you’re looking at is up to the second. This can cause unnecessary confusion for our merchants. The solution here was to ensure there’s some communication to our merchant when the last update was.

      In our previous design, we only had small widgets, and they were able to display only one metric. This information resulted in a long string, that on smaller devices, would sometimes wrap and show over two lines. This was fine when space was abundant in our older design but not in our new data rich designs. We explored how we could best work with timestamps on widgets, and the most promising solution was to use relative time. Instead of having a static value such as “as of 3:30pm” like our previous iteration, we would have a dynamic date that would look like: “1 min, 3 sec ago.”

      One thing to remember is that even though the widget is visible, we have a limited number of updates we can trigger. Otherwise, it would be consuming a lot of unnecessary resources on the merchant’s device. We knew we couldn’t keep triggering updates on the widget as often as we wanted (nor would it be allowed), but iOS has ways to deal with this. Apple did release support for dynamic text on widgets during our development that allowed using timers on your widgets without requiring updates. We simply need to pass a style to a Text component and it automatically keeps everything up to date:

      Text("\(now, style: .relative) ago")

      It was good, but we have no options to customize the relative style. Being able to customize the relative style was an important point for us, as the current supported style does not fit well with our widget layout. One of our biggest constraints with widgets is space as we always need to think about the smallest widget possible. In the end we decided not to move forward with the relative time approach, and kept a reduced version of our previous timestamp.

      Adding Configuration

      Our new widgets have a great amount of configuration, allowing for merchants to choose exactly what they care about. For each widget size, the merchant can select the store, a certain number of metrics, and a date range. On iOS, widgets are configured through the SiriKit Intents API. We faced some challenges with the WidgetConfiguration, but fortunately, all had workarounds that fit our use cases.

      Insights widget configuration
      Insights widget configuration

      It’s Not Possible to Deselect a Metric

      When defining a field that has multiple values provided dynamically, we can limit the number of options per widget family. This was important for us, since each widget size has a different number of metrics it can support. However, the current UI on iOS for widget configuration only allows selecting a value but not deselecting it. So, once we selected a metric we couldn’t remove it, only update the selection. But what if the merchant were only interested in one metric on the small widget? We solved this with a small design change, by providing “None” as an option. If the merchant were to choose this option, it would be ignored and shown as an empty state. 

      It's not possible to validate the user selections

      With the addition of “None” and the way intents are designed, it was possible to select all “None” and have a widget with no metrics. In addition, it was possible to select the same metric twice.. We would like to be able to validate the user selection, but the Intents API didn't support it. The solution was to embrace the fact that a widget can be empty and show as an empty state. Duplicates were filtered out so any more than a single metric choice was changed to “None” before we sent any network requests.

      The First Calls to getTimeline and getSnapshot Don’t Respect the Maximum Metric Count

      For intent configurations provided dynamically, we must provide default values in the IntentHandler. In our case, the metrics list varies per widget family. In the IntentHandler, it’s not possible to query which widget family is being used. So we had to return at least as many metrics as the largest widget (seven). 

      However, even if we limit the number of metrics per family, the first getTimeline and getSnapshot calls in the Timeline Provider were filling the configuration object with all default metrics, so a small widget would have seven metrics instead of two!

      We ended up adding some cleanup code in the beginning of the Timeline Provider methods that trims the list depending on the expected number of metrics.

      Optimizing Testing

      Automated tests are a fundamental part of Shopify’s development process. In the Shopify app, we have a good amount of unit and snapshot tests. The old widgets on Android had good test coverage already, and we built on the existing infrastructure. On iOS, however, there were no tests since it’s currently not possible to add test targets against a widget extension on Xcode.

      Given this would be a complex project and we didn’t want to compromise on quality, we investigated possible solutions for it.

      The simplest solution would be to add each file on both the app and in the widget extension targets, then we could unit test it in the app side in our standard test target. We decided not to do this since we would always need to add a file to both targets, and it would bloat the Shopify app unnecessarily.

      We chose to create a separate module (a framework in our case) and move all testable code there. Then we could create unit and snapshot tests for this module.

      We ended up moving most of the code, like views and business logic, to this new module (WidgetCore), while the extension only had WidgetKit specific code and configuration like Timeline provider, widget bundle, and intent definition generated files.

      Given our code in the Shopify app is based on UIKit, we did have to update our in-house snapshot testing framework to support SwiftUI views. We were very happy with the results. We ended up achieving a high test coverage, and the tests flagged many regressions during development.

      Fast SwiftUI Previews 

      The Shopify app is a big application, and it takes a while to build. Given the widget extension is based on our main app target, it took a long time to prepare the SwiftUI previews. This caused frustration during development. It also removed one of the biggest benefits of SwiftUI—our ability to iterate quickly with Previews and the fast feedback cycle during UI development.

      One idea we had was to create a module that didn’t rely on our main app target. We created one called WidgetCore where we put a lot of our reusable Views and business logic. It was fast to build and could also render SwiftUI previews. The one caveat is, since it wasn’t a widget extension target, we couldn’t leverage the WidgetPreviewContext API to render views on a device. It meant we needed to load up the extension to ensure the designs and changes were always working as expected on all sizes and modes (light and dark).

      To solve this problem, we created a PreviewLayout extension. This had all the widget sizes based on the Apple documentation, and we were able to use it in a similar way:

      Our PreviewLayout extension would be used on all of our widget related views in our WidgetCore module to emulate the sizes in previews:

      Acquiring Analytics

      Shopify Insights widget largest size in light mode
      Largest widget in light mode

      Since our first widgets were developed, we wanted to understand how merchants are using the functionality, so we can always improve it. The data team built some dashboards showing things like a detailed view of how many widgets installed, the most popular metrics, and sizes.

      Most of the data used to build the dashboards come from analytics events fired through the widgets and the Shopify app.

      For the new widgets, we wanted to better understand adoption and retention of widgets, so we needed to capture how users are configuring their widgets over time and which ones are being added or removed.

      Managing Unique Ids

      WidgetKit has the WidgetCenter struct that allows requesting information about the widgets currently configured in the device through the getCurrentConfigurations method. However, the list of metadata returned (WidgetInfo) doesn’t have a stable unique identifier. Its identifier is the object itself, since it’s hashable. Given this constraint, if two identical widgets are added, they’ll both have the same identifier. Also, given the intent configuration is part of the id, if something changes (for example, date range) it’ll look like it’s a totally different widget.

      Given this limitation, we had to adjust the way we calculate the number of unique widgets. It also made it harder to distinguish between different life-cycle events (adding, removing, and configuring). Hopefully there will be a way to get unique ids for widgets in future versions of iOS. For now we created a single value derived from the most important parts of the widget configuration.

      Detecting, Adding, and Removing Widgets 

      Currently there’s no WidgetKit life cycle method that tells us when a widget was added, configured, or removed. We needed it so we can better understand how widgets are being used.

      After some exploration, we noticed that the only methods we could count on were getTimeline and getSnapshot. We then decided to build something that could simulate these missing life cycle methods by using the ones we had available. getSnapshot is usually called on state transitions and also on the widget Gallery, so we discarded it as an option.

      We built a solution that did the following

      1. Every time the Timeline providers’ getTimeline is called, we call WidgetKit’s getCurrentConfigurations to see what are the current widgets installed.
      2. We then compare this list with a previous snapshot we persist on disk.
      3. Based on this comparison we try to guess which widgets were added and removed.
      4. Then we triggered the proper life cycle methods: didAddWidgets(), didRemoveWidgets().

      Due to identifiers not being stable, we couldn’t find a reliable approach to detect configuration changes, so we ended up not supporting it.

      We also noticed that WidgetKit.getCurrentConfigurations’s results can have some delay. If we remove a widget, it may take a couple getTimeline calls for it to be reflected. We adjusted our analytics scheme to take that into account.

      Detecting, adding, and removing widgets solution
      Current solution

      Supporting iOS 16

      Our approach to widgets made supporting iOS 16 out of the gate really simple with a few changes. Since our lock screen complications will surface the same information as our home screen widgets, we can actually reuse the Intent configuration, Timeline Provider, and most of the views! The only change we need to make is to adjust the supported families to include .accessoryInline, .accessoryCircular, and .accessoryRectangular, and, of course, draw those views.

      Our Main View would also just need a slight adjustment to work with our existing home screen widgets.

      Migrating Gracefully

      WidgetKit was introduced for watchOS complications in iOS 16. This update comes with a foreboding message from Apple:

      As soon as you offer a widget-based complication, the system stops calling ClockKit APIs. For example, it no longer calls your CLKComplicationDataSource object’s methods to request timeline entries. The system may still wake your data source for migration requests.

      We really care about our apps at Shopify, so we really needed to unpack what this meant, and how does this affect our merchants running older devices? With some testing on devices, we were able to find out, everything is fine.

      If you’re currently running WidgetKit complications and add support for lock screen complications, your ClockKit app and complications will continue to function as you’d expect.

      What we had assumed was that WidgetKit itself was taking the place of WatchOS complications; however, to use Widgetkit on WatchOS, you need to create a new target for the Watch. This makes sense, although the APIs are so similar we had assumed it was a one and done approach. One WidgetKit extension for both platforms.

      One thing to watch out for,  if you do implement the new WidgetKit on WatchOS, if your users are on WatchOS 9 and above will lose all of their complications from ClockKit. Apple did provide a migration API to support the change that’s called instead of your old complications.

      If you don’t have the luxury of just setting your target to iOS 16, your complications will continue to load up for those on WatchOS 8 and below from our testing.

      Shipping New Widgets

      We already had a set of widgets running on both platforms, now we had to decide how to transition to the new update as they would be replacing the existing implementation. On iOS we had two different widget kinds each with their own small widget (you can think of kinds as a widget group). With the new implementation, we wanted to provide a single widget kind that offered all three sizes. We didn’t find much documentation around the migration, so we simulated what happens to the widgets under different scenarios.

      If the merchant has a widget on their home screen and the app updates, one of two things would happen:

      1. The widget would become a white blank square (the kind IDs matched).
      2. The widget just disappeared altogether (the kind ID was changed).

      The initial plan (we had hoped for) was to make one of our widgets transform into the new widget while removing the other one. Given the above, this strategy wouldn’t work. This also includes some annoying tech debt since all of our Intent files would continue to mention the name of the old widget.

      The compromise we came to, so as to not completely remove all of a merchant’s widgets overnight, was to deprecate the old widgets at the same time we release the new ones. To deprecate, we updated our old widget’s UI to display a message informing that the widget is no longer supported, and the merchant must add the new ones. The lesson here is you have to be careful when you make decisions around widget grouping as it’s not easy to change.

      There’s no way to add a new widget programmatically or to bring the merchant to the widget gallery by tapping on the old widget. We also added some communication to help ease the transition by:

        • updating our help center docs, including information around how to use widgets 
        • pointing our old widgets to open the help center docs 
        • giving lots of time before removing the deprecation message.

        In the end, it wasn’t the most ideal situation, and we came away learning about the pitfalls within the two ecosystems. One piece of advice is to really reflect on current and future needs when defining which widgets to offer and how to split them, since a future modification may not be straightforward.

        Screen displaying the notice: This widget is no longer active. Add the new Shopify Insights widget for an improved view of your data. Learn more.
        Widget deprecation message

        What’s Next

        As we continue to learn about how merchants use our new generation of widgets, we’ll continue to hone in on the best experience across both our platforms. Our widgets were made to be flexible, and we’ll be able to continually grow the list of metrics we offer through our customization. This work opens the way for other teams at Shopify to build on what we’ve started and create more widgets to help Merchants too.

        2022 is a busy year with iOS 16 coming out. We’ve got a new WidgetKit experience to integrate to our watch complications, lock screen complications, and live activities hopefully later this year!

        Carlos Pereira is a Senior Developer based in Montreal, Canada, with more than a decade of experience building native iOS applications in Objective-C, Swift and now React Native. Currently contributing to the Shopify admin app.

        James Lockhart is a Staff Developer based in Ottawa, Canada. Experiencing mobile development over the past 10+ years: from Phonegap to native iOS/Android and now React native. He is an avid baker and cook when not contributing to the Shopify admin app.

        Cecilia Hunka is a developer on the Core Admin Experience team at Shopify. Based in Toronto, Canada, she loves live music and jigsaw puzzles.

        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        What is a Full Stack Data Scientist?

        What is a Full Stack Data Scientist?

        At Shopify, we've embraced the idea of full stack data science and are often asked, "What does it mean to be a full stack data scientist?". The term has seen a recent surge in the data industry, but there doesn’t seem to be a consensus on a definition. So, we chatted with a few Shopify data scientists to share our definition and experience.

        What is a Full Stack Data Scientist?

        "Full stack data scientists engage in all stages of the data science lifecycle. While you obviously can’t be a master of everything, full stack data scientists deliver high-impact, relatively quickly because they’re connected to each step in the process and design of what they’re building." - Siphu Langeni, Data Scientist

        "Full stack data science can be summed up by one word—ownership. As a data scientist you own a project end-to-end. You don't need to be an expert in every method, but you need to be familiar with what’s out there. This helps you identify what’s the best solution for what you’re solving for." - Yizhar (Izzy) Toren, Senior Data Scientist

        Typically, data science teams are organized to have different data scientists work on singular aspects of a data science project. However, a full stack data scientist’s scope covers a data science project from end-to-end, including:

        • Discovery and analysis: How you collect, study, and interpret data from a number of different sources. This stage includes identifying business problems.
        • Acquisition: Moving data from diverse sources into your data warehouse.
        • Data modeling: The process for transforming data using batch, streaming, and machine learning tools.

        What Skills Make a Successful Full Stack Data Scientist? 

        "Typically the problems you're solving for, you’re understanding them as you're solving them. That’s why you need to be constantly communicating with your stakeholders and asking questions. You also need good engineering practices. Not only are you responsible for identifying a solution, you also need to build the pipeline to ship that solution into production." - Yizhar (Izzy) Toren, Senior Data Scientist

        "The most effective full stack data scientists don't just wait for ad hoc requests. Instead, they proactively propose solutions to business problems using data. To effectively do this, you need to get comfortable with detailed product analytics and developing an understanding of how your solution will be delivered to your users." - Sebastian Perez Saaibi, Senior Data Science Manager

        Full stack data scientists are generalists versus specialists. As full stack data scientists own projects from end-to-end, they work with multiple stakeholders and teams, developing a wide range of both technical and business skills, including:

        • Business acumen: Full stack data scientists need to be able to identify business problems, and then ask the right questions in order to build the right solution.
        • Communication: Good communication—or data storytelling—is a crucial skill for a full stack data scientist who typically helps influence decisions. You need to be able to effectively communicate your findings in a way that your stakeholders will understand and implement. 
        • Programming: Efficient programming skills in a language like Python and SQL are essential for shipping your code to production.
        • Data analysis and exploration: Exploratory data analysis skills are a critical tool for every full stack data scientist, and the results help answer important business questions.
        • Machine learning: Machine learning is one of many tools a full stack data scientist can use to answer a business question or solve a problem, though it shouldn’t be the default. At Shopify, we’re proponents of starting simple, then iterating with complexity.  

        What’s the Benefit of Being a Full Stack Data Scientist? 

        “You get to choose how you want to solve different problems. We don't have one way of doing things because it really depends on what the problem you’re solving for is. This can even include deciding which tooling to use.”- Yizhar (Izzy) toren, Senior Data Scientist

        “You get maximum exposure to various parts of the tech stack, develop a confidence in collaborating with other crafts, and become astute in driving decision-making through actionable insights.” - Siphu Langeni, Data Scientist

        As a generalist, is a full stack data scientist a “master of none”? While full stack data scientists are expected to have a breadth of experience across the data science specialty, each will also bring additional expertise in a specific area. At Shopify, we encourage T-shaped development. Emphasizing this type of development not only enables our data scientists to hone skills they excel at, but it also empowers us to work broadly as a team, leveraging the depth of individuals to solve complex challenges that require multiple skill sets. 

        What Tips Do You Have for Someone Looking to Become a Full Stack Data Scientist? 

        “Full stack data science might be intimidating, especially for folks coming from academic backgrounds. If you've spent a career researching and focusing on building probabilistic programming models, you might be hesitant to go to different parts of the stack. My advice to folks taking the leap is to treat it as a new problem domain. You've already mastered one (or multiple) specialized skills, so look at embracing the breadth of full stack data science as a challenge in itself.” - Sebastian Perez Saaibi, Senior Data Science Manager

        “Ask lots of questions and invest effort into gathering context that could save you time on the backend. And commit to honing your technical skills; you gain trust in others when you know your stuff!” - Siphu Langeni, Data Scientist

        To sum it up, a full stack data scientist is a data scientist who:

        • Focuses on solving business problems
        • Is an owner that’s invested in an end-to-end solution, from identifying the business problem to shipping the solution to production
        • Develops a breadth of skills that cover the full stack of data science, while building out T-shaped skills
        • Knows which tool and technique to use, and when

        If you’re interested in tackling challenges as a full stack data scientist, check out Shopify’s career page.

        Continue reading

        Managing React Form State Using the React-Form Library

        Managing React Form State Using the React-Form Library

        One of Shopify’s philosophies when it comes to adopting a new technology isn’t only to level up the proficiency of our developers so they can implement the technology at scale, but also with the intent of sharing their new found knowledge and understanding of the tech with the developer community.

        In Part 1 (Building a Form with Polaris) of this series, we were introduced to Shopify’s Polaris Design System, an open source library used to develop the UI within our Admin and here in Part 2 we’ll delve further into Shopify’s open source Quilt repo that contains 72 npm packages, one of which is the react-form library. Each package was created to facilitate the adoption and standardization of React and each has its own README and thorough documentation to help get you started.

        The react-form Library

        If we take a look at the react-form library repo we can see that it’s used to:

        “Manage React forms tersely and safely-typed with no effort using React hooks. Build up your form logic by combining hooks yourself, or take advantage of the smart defaults provided by the powerful useForm hook.”

        The useForm and useField Hooks

        The documentation categorizes the API into three main sections: Hooks, Validation, and Utilities. There are eight hooks in total and for this tutorial we’ll focus our attention on just the ones most frequently used: useForm and useField.

        useForm is a custom hook for managing the state of an entire form and makes use of many of the other hooks in the API. Once instantiated, it returns an object with all of the fields you need to manage a form. When combined with useField, it allows you to easily build forms with smart defaults for common use cases. useField is a custom hook for handling the state and validations of an input field.

        The Starter CodeSandbox

        As this tutorial is meant to be a step-by-step guide providing all relevant code snippets along the way, we highly encourage you to fork this Starter CodeSandbox so you can code along throughout the tutorial.

        If you hit any roadblocks along the way, or just prefer to jump right into the solution code, here’s the Solution CodeSandbox.

        First Things First—Clean Up Old Code

        The useForm hook creates and manages the state of the entire form which means we no longer need to import or write a single line of useState in our component, nor do we need any of the previous handler functions used to update the input values. We still need to manage the onSubmit event as the form needs instructions as to where to send the captured input but the handler itself is imported from the useForm hook.

        With that in mind let’s remove all the following previous state and handler logic from our form.

        React isn’t very happy at the moment and presents us with the following error regarding the handleTitleChange function not being defined.

        handleTitleChange is not defined

        This occurs because both TextField Components are still referencing their corresponding handler functions that no longer exist. For the time being, we’ll remove both onChange events along with the value prop for both Components.

        Although we’re removing them at the moment, they’re still required as per our form logic and will be replaced by the fields object provided by useForm.

        React still isn’t happy and presents us with the following error in regards to the Page Component assigning onAction to the handleSubmit function that’s been removed.

        handleSubmit is not defined

        It just so happens that the useForm hook provides a submit function that does the exact same thing, which we’ll destructure in the next section. For the time being we’ll assign submit to onAction and place it in quotes so that it doesn’t throw an error.

        One last and final act of cleanup is to remove the import for useState, at the top of the file, as we’ll no longer manage state directly.

        Our codebase now looks like the following:

        Importing and Using the useForm and useField Hooks

        Now that our Form has been cleaned up, let’s go ahead and import both the useForm and useField hooks from react-form. Note, for this tutorial the shopify/react-form library has already been installed as a dependency.

        import { useForm, useField } from "@shopify/react-form”;

        If we take a look at the first example of useForm in the documentation, we can see that useForm provides us quite a bit of functionality in a small package. This includes several properties and methods that can be instantiated, in addition to accepting a configuration object that’s used to define the form fields and an onSubmit function.

        In order to keep ourselves focused on the basic functionality, we start with only capturing the inputs for our two fields, title and description, and then handle the submission of the form.  We’ll also pass in the configuration object, assigning useField() to each field, and lastly, an onSubmit function.

        Since we previously removed the value and onChange props from our TextField components the inputs no longer capture nor display text. They both worked in conjunction where onChange updated state, allowing value to display the captured input once the component re-rendered. The same functionality is still required but those props are now found in the fields object, which we can easily confirm by adding a console.log and viewing the output:

        If we do a bit more investigation and expand the description key, we see all of its additional properties and methods, two of which are onChange and value.

        With that in mind, let’s add the following to our TextField components:

        It’s clear from the code we just added that we’re destructuring the fields object and assigning the key that corresponds to the input’s label field. We should also be able to type into the inputs and see the text updated. The field object also contains additional properties and methods such as reset and dirty that we’ll make use of later when we connect our submit function.

        Submitting the Form

        With our TextField components all set up, it’s time to enable the form to be submitted. As part of the previous clean up process, we updated the Page Components onAction prop and now it’s time to remove the quotes.

        Now that we’ve enabled submission of the form, let’s confirm that the onSubmit function works and take a peek at the fields object by adding a console log.

        Let’s add a title and description to our new product and click Save.

        Adding a Product Screen with a Title and Description fields
        Adding A Product

        We see the following output:

        More Than Just Submitting

        When we reviewed the useForm documentation earlier we made note of all the additional functionality that it provides, two of which we will make use of now: reset and dirty.

        Reset the Form After Submission

        reset is a method and is used to clear the form, providing the user with a clean slate to add additional products once the previous one has been saved. reset should be called only after the fields have been passed to the backend and the data has been handled appropriately, but also before the return statement.

        If you input some text and click Save, our form should clear the input fields as expected.

        Conditionally Enable The Save Button

        dirty is used to disable the Save button until the user has typed some text into either of the input fields. The Page component manages the Save button and has a disabled property that we assign the value of !dirty because its value is set to false when imported, so we need to change that to true. 

        You should now notice that the Save button is disabled until you type into either of the fields, at which point Save is enabled 

        We can also validate that it’s now disabled by examining the Save button in developer tools.

        Developer Tools screenshot showing the code disabling the save button.
        Save Button Disabled

        Form Validation

        What we might have noticed when adding dirty, is that if the user types into either field the Save button is immediately enabled. One last aspect of our form is that we’ll require the Title field to contain some input before being allowed to submit the product. To do this we’ll import the notEmpty hook from react-form.

        Assigning it also requires that we now pass useField a configuration object that includes the following keys: value and validates. The value key keeps track of the current input value and validates provides us a means of validating input based on some criteria.

        In our case, we’ll prevent the form from being submitted if the title field is empty and provide the user an error message indicating that it’s a required field.

        Let’s give it a test run and confirm it’s working as expected. Try adding only a description and then click Save.

        Add Product screen with Title and Description fields. Title field is empty and showing error message “field is required.”
        Field is required error message

        As we can see our form implements all the previous functionality and then some, all of which was done via the useForm and useField hooks. There’s quite a bit more functionality that these specific hooks provide, so I encourage you to take a deeper dive and explore them further.

        This tutorial was meant to introduce you to Shopify’s open source react-form library that’s available in Shopify’s public Quilt repo. The repo provides many more useful React hooks such as the following to name a few:

        • react-graphql: for creating type-safe and asynchronous GraphQL components for React
        • react-testing:  for testing React components according to our conventions
        • react-i18n: which includes i18n utilities for handling translations and formatting.

        Joe Keohan is a Front End Developer at Shopify, located in Staten Island, NY and working on the Self Serve Hub team enhancing the buyer’s experience. He’s been instructing software engineering bootcamps for the past 6 years and enjoys teaching others about software development. When he’s not working through an algorithm, you’ll find him jogging, surfing and spending quality time with his family.

        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        Leveraging Go Worker Pools to Scale Server-side Data Sharing

        Leveraging Go Worker Pools to Scale Server-side Data Sharing

        Maintaining a service that processes over a billion events per day poses many learning opportunities. In an ecommerce landscape, our service needs to be ready to handle global traffic, flash sales, holiday seasons, failovers and the like. Within this blog, I'll detail how I scaled our Server Pixels service to increase its event processing performance by 170%, from 7.75 thousand events per second per pod to 21 thousand events per second per pod.

        But First, What is a Server Pixel?

        Capturing a Shopify customer’s journey is critical to contributing insights into marketing efforts of Shopify merchants. Similar to Web Pixels, Server Pixels grants Shopify merchants the freedom to activate and understand their customers’ behavioral data by sending this structured data from storefronts to marketing partners. However, unlike the Web Pixel service, Server Pixels sends these events through the server rather than client-side. This server-side data sharing is proven to be more reliable allowing for better control and observability of outgoing data to our partners’ servers. The merchant benefits from this as they are able to drive more sales at a lower cost of acquisition (CAC). With regional opt-out privacy regulations built into our service, only customers who have allowed tracking will have their events processed and sent to partners.  Key events in a customer’s journey on a storefront are captured such as checkout completion, search submissions and product views. Server Pixels is a service written in Golang which validates, processes, augments, and consequently, produces more than one billion customer events per day. However, with the management of such a large number of events, problems of scale start to emerge.

        The Problem

        Server Pixels leverages Kafka infrastructure to manage the consumption and production of events. We began to have a problem with our scale when an increase in customer events triggered an increase in consumption lag for our Kafka’s input topic. Our service was susceptible to falling behind events if any downstream components slowed down. Shown in the diagram below, our downstream components process (parse, validate, and augment) and produce events in batches:

        Flow diagram showing the flow of information from Storefronts to Kafka Consumer to the Batch Processor to the Unlimited Go Routines to the Kafka Producer and finally to Third Party Partners.
        Old design

        The problem with our original design was that an unlimited number of threads would get spawned when batch events needed to be processed or produced. So when our service received an increase in events, an unsustainable number of goroutines were generated and ran concurrently.

        Goroutines can be thought of as lightweight threads that are functions or methods that run concurrently with other functions and threads. In a service, spawning an unlimited number of goroutines to execute increasingly growing tasks on a queue is never ideal. The machine executing these tasks will continue to expend its resources, like CPU and memory, until it reaches its limit. Furthermore, our team has a service level objective (SLO) of five minutes for event processing, so any delays in processing our data would exceed our processing beyond its timed deadline. In anticipation of three times the usual load for BFCM, we needed a way for our service to work smarter, not harder.

        Our solution? Go worker pools.

        The Solution

        A flow diagram showing the Go worker pool pattern
        Go worker pool pattern

        The worker pool pattern is a design in which a fixed number of workers are given a stream of tasks to process in a queue. The tasks stay in the queue until a worker is free to pick up the task and execute it. Worker pools are great for controlling the concurrent execution for a set of defined jobs. As a result of these workers controlling the amount of concurrent goroutines in action, less stress is put on our system’s resources. This design also worked perfectly for scaling up in anticipation of BFCM without relying entirely on vertical or horizontal scaling.

        When tasked with this new design, I was surprised at the intuitive setup for worker pools. The premise was creating a Go channel that receives a stream of jobs. You can think of Go channels as pipes that connect concurrent goroutines together, allowing them to communicate with each other. You send values into channels from one goroutine and receive those values into another goroutine. The Go workers retrieve their jobs from a channel as they become available, given the worker isn’t busy processing another job. Concurrently, the results of these jobs are sent to another Go channel or to another part of the pipeline.

        So let me take you through the logistics of the code!

        The Code

        I defined a worker interface that requires a CompleteJobs function that requires a go channel of type Job.

        The Job type takes the event batch, that’s integral to completing the task, as a parameter. Other types, like NewProcessorJob, can inherit from this struct to fit different use cases of the specific task.

        New workers are created using the function NewWorker. It takes workFunc as a parameter which processes the jobs. This workFunc can be tailored to any use case, so we can use the same Worker interface for different components to do different types of work. The core of what makes this design powerful is that the Worker interface is used amongst different components to do varying different types of tasks based on the Job spec.

        CompleteJobs will call workFunc on each Job as it receives it from the jobs channel. 

        Now let’s tie it all together.

        Above is an example of how I used workers to process our events in our pipeline. A job channel and a set number of numWorkers workers are initialized. The workers are then posed to receive from the jobs channel in the CompleteJobs function in a goroutine. Putting go before the CompleteJobs function allows the function to run in a goroutine!

        As event batches get consumed in the for loop above, the batch is converted into a Job that’s emitted to the jobs channel with the <- keyword. Each worker receives these jobs accordingly from the jobs channel. The goroutine we previously called with go worker.CompleteJobs(jobs, &producerWg) runs concurrently and receives these jobs.

        But wait, how do the workers know when to stop processing events?

        When the system is ready to be scaled down, wait groups are used to ensure that any existing tasks in flight are completed before the system shuts down. A waitGroup is a type of counter in Go that blocks the execution of a function until its internal counter becomes zero. As the workers were created above, the waitGroup counter was incremented for every worker that was created with the function producerWg.Add(1). In the CompleteJobs function wg.Done() is executed when the jobs channel is closed and jobs stop being received. wg.Done decrements the waitGroup counter for every worker.

        When a context cancel signal is received (signified by <- ctx.Done() above ), the remaining batches are sent to the Job channel so the workers can finish their execution. The Job channel is closed safely enabling the workers to break out of the loop in CompleteJobs and stop processing jobs. At this point, the WaitGroups’ counters are zero and the outputBatches channel,where the results of the jobs get sent to, can be closed safely. 

        The Improvements

        A flow diagram showing the new design flow from Storefronts to Kafka consumer to batch processor to Go worker pools to Kafka producer to Third party partners.
        New design

        Once deployed, the time improvement using the new worker pool design was promising. I conducted load testing that showed as more workers were added, more events could be processed on one pod. As mentioned before, in our previous implementation our service could only handle around 7.75 thousand events per second per pod in production without adding to our consumption lag.

        My team initially set the number of workers to 15 each in the processor and producer. This introduced a processing lift of  66% (12.9 thousand events per second per pod). By upping the workers to 50, we increased our event load by 149% from the old design resulting in 19.3 thousand events per second per pod. Currently, with performance improvements we can do 21 thousand events per second per pod. A 170% increase! This was a great win for the team and gave us the foundation to be adequately prepared for BFCM 2021, where we experienced a max of 46 thousand events per second!

        Go worker pools are a lightweight solution to speed up computation and allow concurrent tasks to be more performant. This go worker pool design has been reused to work with other components of our service such as validation, parsing, and augmentation. 

        By using the same Worker interface for different components, we can scale out each part of our pipeline differently to meet its use case and expected load. 

        Cheers to more quiet BFCMs!

        Kyra Stephen is a backend software developer on the Event Streaming - Customer Behavior team at Shopify. She is passionate about working on technology that impacts users and makes their lives easier. She also loves playing tennis, obsessing about music and being in the sun as much as possible.

        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        Shopify Data’s Guide To Opportunity Sizing

        Shopify Data’s Guide To Opportunity Sizing

        For every initiative that a business takes on, there is an opportunity potential and a cost—the cost of not doing something else. But how do you tangibly determine the size of an opportunity?

        Opportunity sizing is a method that data scientists can use to quantify the potential impact of an initiative ahead of making the decision to invest in it. Although businesses attempt to prioritize initiatives, they rarely do the math to assess the opportunity, relying instead on intuition-driven decision making. While this type of decision making does have its place in business, it also runs the risk of being easily swayed by a number of subtle biases, such as information available, confirmation bias, or our intrinsic desire to pattern-match a new decision to our prior experience.

        At Shopify, our data scientists use opportunity sizing to help our product and business leaders make sure that we’re investing our efforts in the most impactful initiatives. This method enables us to be intentional when checking and discussing the assumptions we have about where we can invest our efforts.

        Here’s how we think about opportunity sizing.

        How to Opportunity Size

        Opportunity sizing is more than just a tool for numerical reasoning, it’s a framework businesses can use to have a principled conversation about the impact of their efforts.

        An example of opportunity sizing could look like the following equation: if we build feature X, we will acquire MM (+/- delta) new active users in T timeframe under DD assumptions.

        So how do we calculate this equation? Well, first things first, although the timeframe for opportunity sizing an initiative can be anything relevant to your initiative, we recommend an annualized view of the impact so you can easily compare across initiatives. This is important because when your initiative goes live, it can have a significant impact on the in-year estimated impact of your initiative.

        Diving deeper into how to size an opportunity, below are a few methods we recommend for various scenarios.

        Directional T-Shirt Sizing

        Directional t-shirt sizing is the most common approach when opportunity sizing an existing initiative and is a method anyone (not just data scientists) can do with a bit of data to inform their intuition. This method is based on rough estimates and depends on subject matter experts to help estimate the opportunity size based on similar experiences they’ve observed in the past and numbers derived from industry standards. The estimates used in this method rely on knowing your product or service and your domain (for example, marketing, fulfillment, etc.). Usually the assumptions are generalized, assuming overall conversion rates using averages or medians, and not specific to the initiative at hand.

        For example, let’s say your Growth Marketing team is trying to update an email sequence (an email to your users about a new product or feature) and is looking to assess the size of the opportunity. Using the directional t-shirt sizing method, you can use the following data to inform your equation:

        1. The open rates of your top-performing content
        2. The industry average of open rates 

        Say your top-performing content has an open rate of five percent, while the industry average is ten percent. Based on these benchmarks, you can assume that the opportunity can be doubled (from five to ten percent).

        This method offers speed over accuracy, so there is a risk of embedded biases and lack of thorough reflection on the assumptions made. Directional t-shirt sizing should only be used in the stages of early ideation or sanity checking. Opportunity sizing for growth initiatives should use the next method: bottom-up.

        A matrix diagram with high rigor, lower rigor, new initiatives and existing initiatives as categories. It highlights that Directional T-Shirt sizing requires lower rigor opportunities
        Directional t-shirt sizing should be used for existing initiatives that require lower rigor opportunities.

        Bottom-Up Using Comparables

        Unlike directional t-shirt sizing, the bottom-up method uses the performance of a specific comparable product or system as a benchmark, and relies on the specific skills of a data scientist to make data-informed decisions. The bottom-up method is used to determine the opportunity of an existing initiative. The bottom-up method relies on observed data on similar systems, which means it tends to have a higher accuracy than directional t-shirt sizing. Here are some tips for using the bottom-up method:

        1. Understand the performance of a product or system that is comparable.

        To introduce any enhancements to your current product or system, you need to understand how it’s performing in the first place. You’ll want to identify, observe and understand the performance rates in a comparable product or system, including the specifics of its unique audience and process.

        For example, let’s say your Growth Marketing team wants to localize a new welcome email to prospective users in Italy that will go out to 100,000 new leads per year. A comparable system could be a localized welcome email in France that the team sent out the prior year. With your comparable system identified, you’ll want to dig into some key questions and performance metrics like:

        • How many people received the email?
        • Is there anything unique about that audience selection? 
        • What is the participation rate of the email? 
        • What is the conversion rate of the sequence? Or in other words, of those that opened your welcome email, how many converted to customers?

        Let’s say we identified that our current non-localized email in Italy has a click through rate (CTR) of three percent, while our localized email in France has a CTR of five percent over one year. Based on the metrics of your comparable system, you can identify a base metric and make assumptions of how your new initiative will perform.

        2. Be clear and document your assumptions.

        As you think about your initiative, be clear and document your assumptions about its potential impact and the why behind each assumption. Using the performance metrics of your comparable system, you can generate an assumed base metric and the potential impact your initiative will have on that metric. With your base metric in hand, you’ll want to consider the positive and negative impacts your initiative may have, so quantify your estimate in ranges with an upper and lower bound.

        Returning to our localized welcome email example, based on the CTR metrics from our comparable system we can assume the impact of our Italy localization initiative: if we send out a localized welcome email to 100,000 new leads in Italy, we will obtain a CTR between three and five percent (+/- delta) in one year. This is based on our assumptions that localized content will perform better than non-localized content, as seen in the performance metrics of our localized welcome email in France.

        3. Identify the impact of your initiative on your business’ wider goals.

        Now that you have your opportunity sizing estimate for your initiative, the next question that comes to mind is “what does that mean for the rest of your business goals?”. To answer this, you’ll want to estimate the impact on your top-line metric. This enables you to compare different initiatives with an apples-to-apples lens, while also avoiding the tendency to bias to larger numbers when making comparisons and assessing impact. For example, a one percent change in the number of sessions can look much bigger than a three percent change in the number of customers which is further down the funnel.

        Returning to our localized welcome email example, we should ask ourselves how an increase in CTR impacts our topline metric of active user count? Let’s say that when we localized the email in France, we saw an increase of five percent in CTR that translated to a three percent increase in active users per year. Accordingly, if we localize the welcome email in Italy, we may expect to get a three percent increase which would translate to 3,000 more active users per year.

        Second order thinking is a great asset here. It’s beneficial for you to consider potential modifiers and their impact. For instance, perhaps getting more people to click on our welcome email will reduce our funnel performance because we have lower intent people clicking through. Or perhaps it will improve funnel performance because people are better oriented to the offer. What are the ranges of potential impact? What evidence do we have to support these ranges? From this thinking, our proposal may change: we may not be able to just change our welcome email, we may also have to change landing pages, audience selection, or other upstream or downstream aspects.

        A matrix diagram with high rigor, lower rigor, new initiatives and existing initiatives as categories. It highlights that Bottom-up sizing is used for existing initiatives that requires higher rigor opportunities
        Bottom-up opportunity sizing should be used for existing initiatives that require higher rigor opportunities.


        The top-down method should be used when opportunity sizing a new initiative. This method is more nuanced as you’re not optimizing something that exists. With the top-down method, you’ll start by using a larger set of vague information, which you’ll then attempt to narrow down into a more accurate estimation based on assumptions and observations. 2

        Here are a few tips on how to implement the top-down method:

        1. Gather information about your new initiative.

        Unlike the bottom-up method, you won’t have a comparable system to establish a base metric. Instead, seek as much information on your new initiative as you can from internal or external sources.

        For example, let’s say you’re looking to size the opportunity of expanding your product or service to a new market. In this case, you might want to get help from your product research team to gain more information on the size of the market, number of potential users in that market, competitors, etc.

        2. Be clear and document your assumptions.

        Just like the bottom-up method, you’ll want to clearly identify your estimates and what evidence you have to support them. For new initiatives, typically assumptions are going to lean towards being more optimistic than existing initiatives because we’re biased to believe that our initiatives will have a positive impact. This means you need to be rigorous in testing your assumptions as part of this sizing process. Some ways to test your assumptions include:

        • Using the range of improvement of previous initiative launches to give you a sense of what's possible. 
        • Bringing the business case to senior stakeholders and arguing your case. Often this makes you have to think twice about your assumptions.

        You should be conservative in your initial estimates to account for this lack of precision in your understanding of the potential.

        Looking at our example of opportunity sizing a new market, we’ll want to document some assumptions about:

        • The size of the market: What is the size of the existing market versus the new market size. You can gather this information from external datasets. In the absence of data on a market or audience, you can also make assumptions based on similar audiences or regions elsewhere. 
        • The rate at which you think you can reach and engage this market: This includes the assumed conversion rates of new users. The conversion rates may be assumed to be similar to past performance when a new channel or audience was introduced. You can use the tips identified in the bottom-up method.

        3. Identify the impact of your initiative on your business’ wider goals.

        Like the bottom-up method, you need to assess the impact your initiative will have on your business’ wider goals. Based on the above example, what does the assumed impact of our initiative mean in terms of active users?

        A matrix diagram with high rigor, lower rigor, new initiatives and existing initiatives as categories. It highlights that Top-down sizing is for new initiatives that require higher rigor opportunities
        Top-down opportunity sizing should be used for new initiatives that require higher rigor opportunities.

        And there you have it! Opportunity sizing is a worthy investment that helps you say yes to the most impactful initiatives. It’s also a significant way for data science teams to help business leaders prioritize and steer decision-making. Once your initiative launches, test to see how close your sizing estimates were to actuals. This will help you hone your estimates over time.

        Next time your business is outlining its product roadmap, or your team is trying to decide whether it’s worth it to take on a particular project, use our opportunity sizing basics to help identify the potential opportunity (or lack thereof).

        Dr. Hilal is the VP of Data Science at Shopify, responsible for overseeing the data operations that power the company’s commercial and service lines.

        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Data Science & Engineering career page to find out about our open positions. Learn about how we’re hiring to design the future together—a future that is Digital by Design.

        Continue reading

        How We Built Oxygen: Hydrogen’s Counterpart for Hosting Custom Storefronts

        How We Built Oxygen: Hydrogen’s Counterpart for Hosting Custom Storefronts

        In June, we shared the details of how we built Hydrogen, our React-based framework for building custom storefronts. We talked about some of the big bets we made on new technologies like React Server components, and the many internal and external collaborations that made Hydrogen a reality.  

        This time we tell the story of Oxygen, Hydrogen’s counterpart that makes hosting Hydrogen custom storefronts easy and seamless. Oxygen guarantees fast and globally available storefronts that securely integrate with the Shopify ecosystem while eliminating additional costs of setting up third-party hosting tools. 

        We’ll dive into the experiences we focused on, the technical choices we made to build those experiences, and how those choices paved the path for Shopify to get involved in standardizing serverless runtimes in partnership with leaders like Cloudflare and Deno.

        Shopify-Integrated Merchant Experience

        Let’s first briefly look at why we built Oxygen. There are existing products in the market that can host Hydrogen custom storefronts. Oxygen’s uniqueness is in the tight integration it provides with Shopify. Our technical choices so far have largely been grounded in ensuring this integration is frictionless for the user.

        We started with GitHub for version control, GitHub actions for continuous deployment, and Cloudflare for worker runtimes and edge distribution. We combined these third-party services with first-party services such as Shopify CDN, Shopify Admin API, and Shopify Identity and Access Management. They’re glued together by Oxygen-scoped services that additionally provide developer tooling and observability. Oxygen today is the result of bundling together this collection of technologies.

        A flow diagram highlighting the interaction between the developer, GitHub, Oxygen, Cloudflare, the buyer, and Shopify
        Oxygen overview

        We introduced the Hydrogen sales channel as the connector between Hydrogen, Oxygen, and the shop admin. The Hydrogen channel is the portal that provides controls to create and manage custom storefronts, link them to the rest of shop administrative functions, and connect them to Oxygen for hosting. It is built on Shopify’s standard Rails and React stack, leveraging Polaris design system for consistent user experience across Shopify-built admin experiences.

        Fast Buyer Experience

        Oxygen exists to give merchants the confidence that Shopify will deliver an optimal buyer experience while merchants focus on their entrepreneurial objectives. Optimal buyer experience in Oxygen’s context is a combination of high availability guarantees, super fast site performance from anywhere in the world, and resilience to handle high-volume traffic.

        To Build or To Reuse

        This is where we had the largest opportunity to contemplate our technical direction. We could leverage over a decade’s experience at Shopify in building infrastructure solutions that keep the entire Shopify platform up and running to build an infrastructure layer, control plane, and proprietary V8 isolates. In fact, we did briefly and it was a successful venture! However, we ultimately decided to opt for Cloudflare’s battle-hardened worker infrastructure that guarantees buyer access to storefronts within milliseconds due to global edge distribution.

        This foundational decision significantly simplified upfront infrastructural complexity, scale and security risk considerations, allowing us to get Oxygen to merchants faster and validate our bets. These choices also leave us enough room to go back and build our own proprietary version at scale or a simpler variation of it if it makes sense both for the business and the users.

        A flow diagram highlighting the interactions between the Shopify store, Cloudflare, and the Oxygen workers.
        Oxygen workers overview

        We were able to provide merchants and buyers the promised fast performance while locking in access controls. When a buyer makes a request to a Hydrogen custom storefront hosted at, that request is received by Oxygen’s Gateway Worker running in Cloudflare. This worker is responsible for validating that the accessor has the necessary authorization to the shop and the specific storefront version before routing them to the Storefront Worker that is running the Hydrogen-based storefront code. The worker chaining is made possible using Cloudflare’s new Dynamic Dispatch API from Workers for Platforms.

        Partnerships and Open Source

        Rather than reinventing the wheel, we took the opportunity to work with leaders in the JavaScript runtimes space to collectively evolve the ecosystem. We use and contribute to the Workers for Platforms solution through tight feedback loops with Cloudflare. We also jointly established WinterCG, a JavaScript runtimes community group in partnership with Cloudflare, Vercel, Deno, and others. We leaned in to collectively building with and for the community, just the way we like it at Shopify.

        Familiar Developer Experience

        Oxygen largely provides developer-oriented capabilities and we strive to provide a developer experience that cohesively integrates with existing developer workflows. 

        Continuous Data-Informed Improvements

        While the Oxygen platform takes care of the infrastructure and distribution management of custom storefronts, it surfaces critical information about how custom storefronts are performing in production, ensuring fast feedback loops throughout the development lifecycle. Specifically, runtime logs and metrics are surfaced through the observability dashboard within the Hydrogen channel for troubleshooting and performance trend monitoring. The developer-oriented user can extrapolate actions necessary to further improve the site quality.

        A flow diagram highlighting the observability enablement side
        Oxygen observability overview

        We made very deliberate technical choices again on the observability enablement side. Unlike Shopify’s internal observability stack, Oxygen’s observability stack consists of Grafana for dashboards and alerting, Cortex for metrics, Loki for logging, and Tempo for tracing. Under the hood, Oxygen’s Trace Worker runs in Cloudflare and attaches itself to Storefront Workers to capture all of the logging and metrics information and forwards all of it to our Grafana stack. Logs are sent to Loki and metrics are sent to Cortex where the Oxygen Observability Service pulls both on-demand when the Hydrogen channel requests it.

        The tech stack was chosen for two key purposes: Oxygen provided a good test bed to experiment and evaluate these tools for a potential long-term fit for the rest of Shopify, and Oxygen’s use case is fundamentally different from internal Shopify. To support the latter, we needed a way to separate internal-facing from external-facing metrics cleanly while scaling to the data loads. We also needed the tool to be flexible enough that we can provide merchants optionality to integrate with any existing monitoring tools in their workflows.

        What’s Next

        Thanks to many flexible, eager, and collaborative merchants who maintained tight feedback loops every step of the way, Oxygen is used in production today by Allbirds, Shopify Supply, Shopify Hardware, and Denim Tears. It is generally available to our Plus merchants as of June.

        We’re just getting started though! We have our eyes on unlocking composable, plug-and-play styled usage in addition to surfacing deeper development insights earlier in the development lifecycle to shorten feedback loops. We also know there is a lot of opportunity for us to enhance the developer experience by reducing the number of surfaces to interact with, providing more control from the command line, and generally streamlining the Oxygen developer tools with the overall Shopify developer toolbox.

        We’re eager to take in all the merchant feedback as they demand the best of us. It helps us discover, learn, and push ourselves to revalidate assumptions, which will ultimately create new opportunities for the platform to evolve.

        Curious to learn more? We encourage you to check out the docs!

        Sneha is an engineering leader on the Oxygen team and has contributed to various teams building tools for a developer-oriented audience.

        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        How We Enable Two-Day Delivery in the Shopify Fulfillment Network

        How We Enable Two-Day Delivery in the Shopify Fulfillment Network

        Merchants sign up for Shopify Fulfillment Network (SFN) to save time on storing, packing, and shipping their merchandise, which is distributed and balanced across our network of Shopify-certified warehouses throughout North America. We optimize the placement of inventory so it’s closer to buyers, saving on shipping time, costs, and carbon emissions.

        Recently, SFN also made it easier and more affordable for merchants to offer two-day delivery  across the United States. We consider many real-world factors to provide accurate delivery dates, optimizing for sustainable ground shipment methods and low cost.

        A Platform Solution

        As with most new features at Shopify, we couldn’t just build a custom solution for SFN. Shopify is an extensible platform with a rich ecosystem of apps that solve complex merchant problems. Shopify Fulfillment Network is just one of the many third-party logistics (3PL) solutions available to merchants. 

        This was a multi-team initiative. One team built the delivery date platform in the core of Shopify, consisting of a new set of GraphQL APIs that any 3PL can use to upload their delivery dates to the Shopify platform. Another team integrated the delivery dates into the Shopify storefront, where they are shown to buyers on the product details page and at checkout.  A third team built the system to calculate and upload SFN delivery dates to the core platform. SFN is an app that merchants install in their shops, in the same way other 3PLs interact with the Shopify platform. The SFN app calls the new delivery date APIs to upload its own delivery dates to Shopify. For accuracy, the SFN delivery dates are calculated using network, carrier and product data. Let’s take a closer look at these inputs to the delivery date.

        Four Factors That Determine Delivery Date

        There are many factors to predict when a package will leave the warehouse, and how long it will spend in transport to the destination address. Each 3PL has its own particular considerations, and the Shopify platform is flexible enough to support them all. Requiring them to conform to fine-grained platform primitives such as operating days and capacity or processing and transit times would only result in loss of fidelity of their particular network and processes.

        With that in mind, we let 3PLs populate the platform with their pre-computed delivery dates that Shopify surfaces to buyers on the product details page and checkout. The 3PL has full control over all the factors that affect their delivery dates. Let’s take a look at some of these factors.

        1. Proximity to the Destination

        The time required for delivery is dependent on the distance the package must travel to arrive at its destination. Usually, the closer the inventory is to the destination, the faster the delivery. This means that SFN delivery dates depend on specific inventory availability throughout the network. 

        2. Heavy, Bulky, or Dangerous Goods

        Some carrier services aren’t applicable to merchandise that exceeds a specific weight or dimensions. Others can’t be used to transport hazardous materials. The delivery date we predict for such items must be based on a shipping carrier service that can meet the requirements.

        3. Time Spent at Fulfillment Centers

        Statutory holidays and warehouse closures affect when the package can be ready for shipment. Fulfillment centers have regular operating calendars and sometimes exceptional circumstances can force them to close, such as severe weather events.

        On any given operating day, warehouse staff need time to pick and pack items into packages for shipment. There’s also a physical limit to how many packages can be processed in a day. Again, exceptional circumstances such as illness can reduce the staff available at the fulfillment center, reducing capacity and increasing processing times.

        4. Time Spent in the Hands of Shipping Carriers

        In SFN, shipping carriers such as UPS and USPS are used to transport packages from the warehouse to their destination. Just like the warehouses, shipping carriers have their own holidays and closures that affect when the package can be picked up, transported, and delivered. These are modeled as carrier transit days, when packages are moved between hubs in the carrier’s own network, and delivery days, when packages are taken from the last hub to the final delivery address.

        Shipping carriers send trucks to pick up packages from the warehouse at scheduled times of the day. Orders that are made after the last pickup for the day have to wait until the next day to be shipped out. Some shipping carriers only pick up from the warehouse if there are enough packages to make it worth their while. Others impose a limit on the volume of packages they can take away from the warehouse, according to their truck dimensions. These capacity limits influence our choice of carrier for the package.

        Shipping carriers also publish the expected number of days it takes for them to transport a package from the warehouse to its destination (called Time in Transit). The first transit day is the day after pickup, and the last transit day is the delivery day. Some carriers deliver on Saturdays even though they won’t transport the package within their network on a Saturday.

        Putting It All Together

        Together, all of these factors are considered when we decide which fulfillment center should process a shipment and which shipping carrier service to use to transport it to its final destination. At regular intervals, we select a fulfillment center and carrier service that optimizes for sustainable two-day delivery for every merchant SKU to every US zip code. From this, we upload pre-calculated schedules of delivery dates to the Shopify platform.

        That’s a Lot of Data

        It’s a lot of data, but much of it can be shared between merchant SKUs. Our strategy is to produce multiple schedules, each one reflecting the delivery dates for inventory available at the same set of fulfillment centers and with similar characteristics such as weight and dimensions. Each SKU is mapped to a schedule, and SKUs from different shops can share the same schedule.

        Example mapping of SKUs to schedules of delivery dates
        Example mapping of SKUs to schedules of delivery dates

        In this example, SKU-1.2 from Shop 1 and SKU-3.1 from Shop 3 share the same schedule of delivery dates for heavy items stocked in both California and New York. If an order is placed today by 4pm EST for SKU-1.2 or SKU-3.1, shipping to zip code 10019 in New York, it will arrive on June 10. Likewise, if an order is placed today for SKU-1.2 or SKU-3.1 by 3pm PST, shipping to zip code 90002 in California, it will arrive on June 11.

        Looking Up Delivery Dates in Real Time

        When Shopify surfaces a delivery date on a product details page or at checkout, it’s a direct, fast lookup (under 100 ms) to find the pre-computed date for that item to the destination address. This is because the SFN app uploads pre-calculated schedules of delivery dates to the Shopify platform and maps each SKU to a schedule using the delivery date APIs.  

        SFN System overview
        System overview

        The date the buyer sees at checkout is sent back to SFN during order fulfillment, where it’s used to ensure that the order is routed to a warehouse and shipping label that meets the delivery date.

        There you have it, a highly simplified overview of how we built two-day delivery in the SFN.

        Learn More About SFN

        Spin Cycle: Shopify's SFN Team Overcomes a Cloud-Development Spiral

        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        Instant Performance Upgrade: From FlatList to FlashList

        Instant Performance Upgrade: From FlatList to FlashList

        When was the last time you scrolled through a list? Odds are it was today, maybe even seconds ago. Iterating over a list of items is a feature common to many frameworks, and React Native’s FlatList is no different. The FlatList component renders items in a scrollable view without you having to worry about memory consumption and layout management (sort of, we’ll explain later).

        The challenge, as many developers can attest to, is getting FlatList to perform on a range of devices without display artifacts like drops in UI frames per second (FPS) and blank items while scrolling fast.

        Our React Native Foundations team solved this challenge by creating FlashList, a fast and performant list component that can be swapped into your project with minimal effort. The requirements for FlashList included

        • High frame rate: even when using low-end devices, we want to guarantee the user can scroll smoothly, at a consistent 60 FPS or greater.
        • No more blank cells: minimize the display of empty items caused by code not being able to render them fast enough as the user scrolls (and causing them to wonder what’s up with their app).
        • Developer friendly: we wanted FlashList to be a drop-in replacement for FlatList and also include helpful features beyond what the current alternatives offer.

        The FlashList API is five stars. I've been recommending all my friends try FlashList once we open source.

        Daniel Friyia, Shopify Retail team
        Performance comparison between FlatList and FlashList

        Here’s how we approached the FlashList project and its benefits to you.

        The Hit List: Why the World Needs Faster Lists

        Evolving FlatList was the perfect match between our mission to continuously create powerful components shared across Shopify and solving a difficult technical challenge for React Native developers everywhere. As more apps move from native to React Native, how could we deliver performance that met the needs of today’s data-hungry users while keeping lists simple and easy to use for developers?

        Lists are in heavy use across Shopify, in everything from Shop to POS. For Shopify Mobile in particular, where over 90% of the app uses native lists, there was growing demand for a more performant alternative to FlatList as we moved more of our work to React Native. 

        What About RecyclerListView?

        RecyclerListView is a third-party package optimized for rendering large amounts of data with very good real-world performance. The difficulty lies in using it, as developers must spend a lot of time understanding how it works, playing with it, and requiring significant amounts of code to manage. For example, RecyclerListView needs predicted size estimates for each item and if they aren’t accurate, the UI performance suffers. It also renders the entire layout in JavaScript, which can lead to visible glitches.

        When done right, RecyclerListView works very well! But it’s just too difficult to use most of the time. It’s even harder if you have items with dynamic heights that can’t be measured in advance.

        So, we decided to build upon RecyclerListView and solve these problems to give the world a new list library that achieved native-like performance and was easy to use.

        Shop app's search page using FlashList

        The Bucket List: Our Approach to Improving React Native Lists

        Our React Native Foundations team takes a structured approach to solving specific technical challenges and creates components for sharing across Shopify apps. Once we’ve identified an ambitious problem, we develop a practical implementation plan that includes rigorous development and testing, and an assessment of whether something should be open sourced or not. 

        Getting list performance right is a popular topic in the React Native community, and we’d heard about performance degradations when porting apps from native to React Native within Shopify. This was the perfect opportunity for us to create something better. We kicked off the FlashList project to build a better list component that had a similar API to FlatList and boosted performance for all developers. We also heard some skepticism about its value, as some developers rarely saw issues with their lists on newer iOS devices.

        The answer here was simple. Our apps are used on a wide range of devices, so developing a performant solution across devices based on a component that was already there made perfect sense.

        “We went with a metrics-based approach for FlashList rather than subjective, perception-based feelings. This meant measuring and improving hard performance numbers like blank cells and FPS to ensure the component worked on low-end devices first, with high-end devices the natural byproduct.” - David Cortés, React Native Foundations team

        FlashList feedback via Twitter

        Improved Performance and Memory Utilization

        Our goal was to beat the performance of FlatList by orders of magnitude, measured by UI thread FPS and JavaScript FPS. With FlatList, developers typically see frame drops, even with simple lists. With FlashList, we improved upon the FlatList approach of generating new views from scratch on every update by using an existing, already allocated view and refreshing elements within it (that is, recycling cells). We also moved some of the layout operations to native, helping smooth over some of the UI glitches seen when RecyclerListView is provided with incorrect item sizes.

        This streamlined approach boosted performance to 60 FPS or greater—even on low-end devices!

        Comparison between FlatList and FlashList via Twitter (note this is on a very low-end device)

        We applied a similar strategy to improve memory utilization. Say you have a Twitter feed with 200-300 tweets per page, FlatList starts rendering with a large number of items to ensure they’re available as the user scrolls up or down. FlashList, in comparison, requires a much smaller buffer which reduces memory footprint, improves load times, and keeps the app significantly more responsive. The default buffer is just 250 pixels.

        Both these techniques, along with other optimizations, help FlashList achieve its goal of no more blank cells, as the improved render times can keep up with user scrolling on a broader range of devices. 

        Shopify Mobile is using FlashList as the default and the Shop team moved search to FlashList. Multiple teams have seen major improvements and are confident in moving the rest of their screens.

        Talha Naqvi, Software Development Manager, Retail Accelerate

        These performance improvements included extensive testing and benchmarking on Android devices to ensure we met the needs of a range of capabilities. A developer may not see blank items or choppy lists on the latest iPhone but that doesn’t mean it’ll work the same on a lower-end device. Ensuring FlashList was tested and working correctly on more cost-effective devices made sure that it would work on the higher-end ones too.

        Developer Friendly

        If you know how FlatList works, you know how FlashList works. It’s easy to learn, as FlashList uses almost the same API as FlatList, and has new features designed to eliminate any worries about layout and performance, so you can focus on your app’s value.

        Shotgun's main feature is its feed, so ensuring consistent and high-quality performance has always been crucial. That's why using FlashList was a no brainer. I love that compared to the classic FlatList, you can scrollToIndex to an index that is not within your view. This allowed us to quickly develop our new event calendar list, where users can jump between dates to see when and where there are events.

        It takes seconds to swap your existing list implementation from FlatList to FlashList. All you need to do is change the component name and optionally add the estimatedItemSize prop, as you can see in this example from our documentation:

        Example of FlashList usage.

        Powerful Features

        Going beyond the standard FlatList props, we added new features based on common scenarios and developer feedback. Here are three examples:

        • getItemType: improves render performance for lists that have different types of items, like text vs. image, by leveraging separate recycling pools based on type.
        Example of getItemType usage
        • Flexible column spans: developers can create grid layouts where each item’s column span can be set explicitly or with a size estimate (using overrideItemLayout), providing flexibility in how the list appears.
        • Performance metrics: as we moved many layout functions to native, FlashList can extract key metrics for developers to measure and troubleshoot performance, like reporting on blank items and FPS. This guide provides more details.

        Documentation and Samples

        We provide a number of resources to help you get started with FlashList:

        Flash Forward with FlashList

        Accelerate your React Native list performance by installing FlashList now. It’s already deployed within Shopify Mobile, Shop, and POS and we’re working on known issues and improvements. We recommend starring the FlashList repo and can’t wait to hear your feedback!

        Want to read more about the process of launching this open source project? Check out our follow-up post, What We Learned from Open-Sourcing FlashList.

        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        When is JIT Faster Than A Compiler?

        When is JIT Faster Than A Compiler?

        I had this conversation over and over before I really understood it. It goes:

        “X language can be as fast as a compiled language because it has a JIT compiler!”
        “Wow! It’s as fast as C?”
        “Well, no, we’re working on it. But in theory, it could be even FASTER than C!”
        “Really? How would that work?”
        “A JIT can see what your code does at runtime and optimize for only that. So it can do things C can’t!”
        “Huh. Uh, okay.”

        It gets hand-wavy at the end, doesn’t it? I find that frustrating. These days I work on YJIT, a JIT for Ruby. So I can make this extremely NOT hand-wavy. Let’s talk specifics.

        I like specifics.

        Wait, What’s JIT Again?

        An interpreter reads a human-written description of your app and executes it. You’d usually use interpreters for Ruby, Python, Node.js, SQL, and nearly all high-level dynamic languages. When you run an interpreted app, you download the human-written source code, and you have an interpreter on your computer that runs it. The interpreter effectively sees the app code for the first time when it runs it. So an interpreter doesn’t usually spend much time fixing or improving your code. They just run it how it’s written. An interpreter that significantly transforms your code or generates machine code tends to be called a compiler.

        A compiler typically turns that human-written code into native machine code, like those big native apps you download. The most straightforward compilers are ahead-of-time compilers. They turn human-written source code into a native binary executable, which you can download and run. A good compiler can greatly speed up your code by putting a lot of effort into improving it ahead of time. This is beneficial for users because the app developer runs the compiler for them. The app developer pays the compile cost, and users get a fast app. Sometimes people call anything a compiler if it translates from one kind of code to another—not just source code to native machine code. But when I say “compiler” here, I mean the source-code-to-machine-code kind.

        A JIT, aka a Just-In-Time compiler, is a partial compiler. A JIT waits until you run the program and then translates the most-used parts of your program into fast native machine code. This happens every time you run your program. It doesn’t write the code to a file—okay, except MJIT and a few others. But JIT compilation is primarily a way to speed up an interpreter—you keep the source code on your computer, and the interpreter has a JIT built into it. And then long-running programs go faster.

        It sounds kind of inefficient, doesn’t it? Doing it all ahead of time sounds better to me than doing it every time you run your program.

        But some languages are really hard to compile correctly ahead of time. Ruby is one of them. And even when you can compile ahead of time, often you get bad results. An ahead-of-time compiler has to create native code that will always be correct, no matter what your program does later, and sometimes that means it’s about as bad as an interpreter, which has that exact same requirement.

        Ruby is Unreasonably Dynamic

        Ruby is like my four-year-old daughter: the things I love most about it are what make it difficult.

        In Ruby, I can redefine + on integers like 3, 7, or -45. Not just at the start—if I wanted, I could write a loop and redefine what + means every time through that loop. My new + could do anything I want. Always return an even number? Print a cheerful message? Write “I added two numbers” to a log file? Sure, no problem.

        That’s thrilling and wonderful and awful in roughly equal proportions.

        And it’s not just +. It’s every operator on every type. And equality. And iteration. And hashing. And so much more. Ruby lets you redefine it all.

        The Ruby interpreter needs to stop and check every time you add two numbers if you have changed what + means in between. You can even redefine + in a background thread, and Ruby just rolls with it. It picks up the new + and keeps right on going. In a world where everything can be redefined, you can be forgiven for not knowing many things, but the interpreter handles it.

        Ruby lets you do awful, awful things. It lets you do wonderful, wonderful things. Usually, it’s not obvious which is which. You have expressive power that most languages say is a very bad idea.

        I love it.

        Compilers do not love it.

        When JITs Cheat, Users Win

        Okay, we’ve talked about why it’s hard for ahead-of-time (AOT) compilers to deliver performance gains. But then, how do JIT compilers do it? Ruby lets you constantly change absolutely everything. That’s not magically easy at runtime. If you can’t compile + or == or any operator, why can you compile some parts of the program?

        With a JIT, you have a compiler around as the program runs. That allows you to do a trick.

        The trick: you can compile the method wrong and still get away with it.

        Here’s what I mean.

        YJIT asks, “Well, what if you didn’t change what + means every time?” You almost never do that. So it can compile a version of your method where + keeps its meaning from right now. And so does equals, iteration, hashing, and everything you can change in Ruby but you nearly never do.

        But… that’s wrong. What if I do change those things? Sometimes apps do. I’m looking at you, ActiveRecord.

        But your JIT has a compiler around at runtime. So when you change what + means, it will throw away all those methods it compiled with the old definition. Poof. Gone. If you call them again, you get the interpreted version again. For a while—until JIT compiles a new version with the new definition. This is called de-optimization. When the code starts being wrong, throw it away. When 3+4 stops being 7 (hey, this is Ruby!), get rid of the code that assumed it was. The devil is in the details—switching from one version of a method to another version midway through is not easy. But it’s possible, and JIT compilers basically do it successfully.

        So your JIT can assume you don’t change + every time through the loop. Compilers and interpreters can’t get away with that.

        An AOT compiler has to create fully correct code before your app even ships. It’s very limited if you change anything. And even if it had some kind of fallback (“Okay, I see three things 3+4 could be in this app”), it can only respond at runtime with something it figured out ahead of time. Usually, that means very conservative code that constantly checks if you changed anything.

        An interpreter must be fully correct and respond immediately if you change anything. So they normally assume that you could have redefined everything at any time. The normal Ruby interpreter spends a lot of time checking if you changed the definition of + over time. You can do  clever things to speed up that check, and CRuby does. But if you make your interpreter extremely clever, pre-building optimized code and invalidating assumptions, eventually you realize that you’ve built a JIT.

        Ruby and YJIT

        I work on YJIT, which is part of CRuby. We do the stuff I mention here. It’s pretty fast.

        There are a lot of fun specifics to figure out. What do we track? How do we make it faster? When it’s invalid, do we need to recompile or cancel it? Here’s an example I wrote recently.

        You can try out our work by turning on --yjit on recent Ruby versions. You can use even more of our work if you build the latest head-of-master Ruby, perhaps with ruby-build 3.2.0-dev. You can also get all the details by reading the source, which is built right into CRuby.

        By the way, YJIT has some known bugs in 3.1 that mean you should NOT use it for real production. We’re a lot closer now—it should be production-ready for some uses in 3.2, which comes out Christmas 2022.

        What Was All That Again?

        A JIT can add assumptions to your code, like the fact that you probably didn’t change what + means. Those assumptions make the compiled code faster. If you do change what + means, you can throw away the now-incorrect code.

        An ahead-of-time compiler can’t do that. It has to assume you could change anything you want. And you can.

        An interpreter can’t do that. It has to assume you could have changed anything at any time. So it re-checks constantly. A sufficiently smart interpreter that pre-builds machine code for current assumptions and invalidates if it changes could be as fast as JIT… Because it would be a JIT.

        And if you like blog posts about compiler internals—who doesn’t?—you should hit “Yes, sign me up” up above and to the left.

        Noah Gibbs wrote the ebook Rebuilding Rails and then a lot about how fast Ruby is at various tasks. Despite being a grumpy old programmer in Inverness, Scotland, Noah believes that some day, somehow, there will be a second game as good as Stuart Smith’s Adventure Construction Set for the Apple IIe. Follow Noah on Twitter and GitHub

        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        Spin Cycle: Shopify’s SFN Team Overcomes a Cloud-Development Spiral

        Spin Cycle: Shopify’s SFN Team Overcomes a Cloud-Development Spiral

        You may have read about Spin, Shopify’s new cloud-based development tool. Instead of editing and running a local version of a service on a developer’s MacBook, Shopify is moving towards a world where the development servers are available on-demand as a container running in Kubernetes. When using Spin, you don’t need anything on your local machine other than an ssh client and VSCode, if that’s your editor of choice. 

        By moving development off our MacBooks and onto Spin, we unlock the ability to easily share work in progress with coworkers and can work on changes that span different codebases without any friction. And because Spin instances are lightweight and ephemeral, we don’t run the risk of messing up long-lived development databases when experimenting with data migrations.

        Across Shopify, teams have been preparing and adjusting their codebases so that their services can run smoothly in this kind of environment. In the Shopify Fulfillment Network (SFN) engineering org, we put together a team of three engineers to get us up and running on Spin.

        At first, it seemed like the job would be relatively straightforward. But as we started doing the work, we began to notice some less obvious forces at play that were pushing against our efforts.

        Since it was easier for most developers to use our old tooling instead of Spin while we were getting the kinks worked out, developers would often unknowingly commit changes that broke some functionality we’d just enabled for Spin. In hindsight, the process of getting SFN working on Spin is a great example of the kind of hidden difficulty in technical work that's more related to human systems than how to get bits of electricity to do what you want.

        Before we get to the interesting stuff, it’s important to understand the basics of the technical challenge. We'll start by getting a broad sense of the SFN codebase and then go into the predictable work that was needed to get it running smoothly in Spin. With that foundation, we’ll be able to describe how and why we started treading water, and ultimately how we’re pushing past that.

        The Shape of SFN

        SFN exists to take care of order fulfillment on behalf of Shopify merchants. After a customer has completed the checkout process, their order information is sent to SFN. We then determine which warehouse has enough inventory and is best positioned to handle the order. Once SFN has identified the right warehouse, it sends the order information to the service responsible for managing that warehouse’s operations. The state of the system is visible to the merchant through the SFN app running in the merchant’s Shopify admin. The SFN app communicates to Shopify Core via the same GraphQL queries and mutations that Shopify makes available to all app developers.

        At a highly simplified level, this is the general shape of the SFN codebase:

        Diagram of SFN codebase. SFN is a large rectangle in the center containing six boxes labelled Subcomponent. Outside of the SFN box are six directional arrows. Each arrow connects to boxes called Dependency
        SFN’s monolithic Rails application with many external dependencies

        Similar to the Shopify codebase, SFN is a monolithic Rails application divided into individual components owned by particular teams. Unlike Shopify Core, however, SFN has many strong dependencies on services outside of its own monolith.

        SFN’s biggest dependency is on Shopify itself, but there are plenty more. For example, SFN does not design shipping labels, but does need to send shipping labels to the warehouse. So, SFN is a client to a service that provides valid shipping labels. Similarly, SFN does not tell the mobile Chuck robots in a warehouse where to go— we are a client of a service that handles warehouse operations.

        The value that SFN provides is in gluing together a bunch of separate systems with some key logic living in that glue. There isn't much you can do with SFN without those dependencies around in some form.

        How SFN Handles Dependencies

        As software developers, we need quick feedback about whether an in-progress change is working as expected. And to know if something is working in SFN, that code generally needs to be validated alongside one or several of SFN’s dependencies. For example, if a developer is implementing a feature to display some text in the SFN app after a customer has placed an order, there’s no useful way to validate that change without also having Shopify available.

        So the work of getting a useful development environment for SFN with Spin appears to be about looking at each dependency, figuring how to handle that dependency, and then implementing that decision. We have a few options for how to handle any particular dependency when running SFN in Spin:

        1. Run an instance of the dependency directly in the same Spin container.
        2. Mock the dependency.
        3. Use a shared running instance of the dependency, such as a staging or live test environment.

        Given all the dependencies that SFN has, this seems like a decent amount of work for a three-person team.

        But this is not the full extent of the problem—it’s just the foundation.

        Once we added configuration to make some dependency or some functional flow of SFN work in Spin, another commit would often be added to SFN that nullifies that effort. For example, after getting some particular flow functioning in a Spin environment, the implementation of that flow might be rewritten with new dependencies that are not yet configured to work in Spin.

        One apparent solution to this problem would be simply to pay more attention to what work is in flight in the SFN codebase and better prepare for upcoming changes.

        But here’s the problem: It’s not just one or two flows changing. Across SFN, the internal implementation of functionality is constantly being improved and refactored. With over 150 SFN engineers deploying to production over 30 times a day, the SFN codebase doesn’t sit still for long. On top of that, Spin itself is constantly changing. And all of SFN’s dependencies are changing. For any dependencies that were mocked, those mocks will become stale and need to be updated.

        The more we accomplished, the more functionality existed with the potential to stop working when something changes. And when one of those regressions occurred, we needed to interrupt the dependency we were working on solving in order to keep a previously solved flow functioning. The tension between making improvements and maintaining what you’ve already built is central to much of software engineering. Getting SFN working on Spin was just a particularly good example.

        The Human Forces on the System

        After recognizing the problem, we needed to step back and look at the forces acting on the system. What incentive structures and feedback loops are contributing to the situation?

        In the case of getting SFN working on Spin, changes were happening frequently and those changes were causing regressions. Some of those changes were within our control (e.g., a change goes into SFN that isn’t configured to work in Spin), and some are less so (e.g., Spin itself changing how it uses certain inputs).

        This led us to observe two powerful feedback loops that could be happening when SFN developers are working in Spin:

        Two feedback loops: Loop of Happy Equilibrium and Spiral of Struggle.
        Loop of Happy Equilibrium and Spiral of Struggle

        If it’s painful to use Spin for SFN development, it’s less likely that developers will use Spin the next time they have to validate their work. And if a change hasn’t been developed and tested using Spin, maybe something about that change breaks a particular testing flow, and that causes another SFN developer to become frustrated enough to stop using Spin. And this cycle continues until SFN is no longer usable in Spin.

        Alternatively, if it’s a great experience to use and validate work in Spin, developers will likely want to continue using the tool, which will catch any Spin-specific regressions before they make it into the main branch.

        As you can imagine, it’s very difficult to move from the Spiral of Struggle into the positive Loop of Happy Equilibrium. Our solution is to try our best to dampen the force acting on the negative spiral while simultaneously propping up the force of the positive feedback loop. 

        As the team focused on getting SFN working on Spin, our solution to this problem was to be very intentional about where we spent our efforts while asking the other SFN developers to endure a little pain and pitch in as we go through this transition. The SFN-on-Spin team narrowed its focus to just getting SFN to a basic level of functionality on Spin so that most developers could use it for the most common validation flows, and we prioritized fixing any bugs that disrupted those areas. This meant explicitly not working to get all SFN functionality running Spin, but just enough so that we could manage upkeep. And at the same time, we asked other SFN developers to use Spin for their daily work, even though it’s missing functionality they need or want. Where they feel frustrations or see gaps, we encouraged and supported them in adding the functionality they needed.

        Breaking the cycle

        Our hypothesis is that this is a temporary stage of transition to cloud development. If we’re successful, we’ll land in the Loop of Happy Equilibrium where regressions are caught before they’re merged, individuals add the missing functionality they need, and everyone ultimately has a fun time developing. They will feel confident about shipping their code.

        Our job seems to be all about code and making computers do what we say. But many of the real-life challenges we face when working on a codebase are not apparent from code or architecture diagrams. Instead they require us to reflect on the forces operating on the humans that are building that software. And once we have an idea of what those forces might be, we can brainstorm how to disrupt or encourage the feedback loops we’ve observed.

        Jen is a Staff Software Engineer at Shopify who's spent her career seeking out and building teams that challenge the status quo. In her free time, she loves getting outdoors and spending time with her chocolate lab.

        Learn More About Spin

        The Journey to Cloud Development: How Shopify Went All-In On Spin
        The Story Behind Shopify's Isospin Tooling
        Spin Infrastructure Adventures: Containers, Systemd, and CGroups

        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        Mastering React’s Stable Values

        Mastering React’s Stable Values

        The concept of stable value is a distinctly React term, and especially relevant since the introduction of Functional ComponentsIt refers to values (usually coming from a hook) that have the same value across multiple renders. And they’re immediately confusing. In this post, Colin Gray, Principal Developer at Shopify, walks through some cases where they really matter and how to make sense of them.

        Continue reading

        Data-Centric Machine Learning: Building Shopify Inbox’s Message Classification Model

        Data-Centric Machine Learning: Building Shopify Inbox’s Message Classification Model

        By Eric Fung and Diego Castañeda

        Shopify Inbox is a single business chat app that manages all Shopify merchants’ customer communications in one place, and turns chats into conversions. As we were building the product it was essential for us to understand how our merchants’ customers were using chat applications. Were they reaching out looking for product recommendations? Wondering if an item would ship to their destination? Or were they just saying hello? With this information we could help merchants prioritize responses that would convert into sales and guide our product team on what functionality to build next. However, with millions of unique messages exchanged in Shopify Inbox per month, this was going to be a challenging natural language processing (NLP) task. 

        Our team didn’t need to start from scratch, though: off-the-shelf NLP models are widely available to everyone. With this in mind, we decided to apply a newly popular machine learning process—the data-centric approach. We wanted to focus on fine-tuning these pre-trained models on our own data to yield the highest model accuracy, and deliver the best experience for our merchants.

        A merchant’s Shopify Inbox screen titled Customers that displays snippets of messages from customers that are labelled with things for easy identification like product details, checkout, and edit order.
        Message Classification in Shopify Inbox

        We’ll share our journey of building a message classification model for Shopify Inbox by applying the data-centric approach. From defining our classification taxonomy to carefully training our annotators on labeling, we dive into how a data-centric approach, coupled with a state-of-the-art pre-trained model, led to a very accurate prediction service we’re now running in production.

        Why a Data-Centric Approach?

        A traditional development model for machine learning begins with obtaining training data, then successively trying different model architectures to overcome any poor data points. This model-centric process is typically followed by researchers looking to advance the state-of-the-art, or by those who don't have the resources to clean a crowd-sourced dataset.

        By contrast, a data-centric approach focuses on iteratively making the training data better to reduce inconsistencies, thereby yielding better results for a range of models. Since anyone can download a well-performing, pre-trained model, getting a quality dataset is the key differentiator in being able to produce a high-quality system. At Shopify, we believe that better training data yields machine learning models that can serve our merchants better. If you’re interested in hearing more about the benefits of the data-centric approach, check out Andrew Ng’s talk on MLOps: From Model-centric to Data-centric.

        Our First Prototype

        Our first step was to build an internal prototype that we could ship quickly. Why? We wanted to build something that would enable us to understand what buyers were saying. It didn’t have to be perfect or complex, it just had to prove that we could deliver something with impact. We could iterate afterwards. 

        For our first prototype, we didn't want to spend a lot of time on the exploration, so we had to construct both the model and training data with limited resources. Our team chose a pre-trained model available on TensorFlow Hub called Universal Sentence Encoder. This model can output embeddings for whole sentences while taking into account the order of words. This is crucial for understanding meaning. For example, the two messages below use the same set of words, but they have very different sentiments:

        • “Love! More please. Don’t stop baking these cookies.”
        • “Please stop baking more cookies! Don’t love these.”

        To rapidly build our training dataset, we sought to identify groups of messages with similar meaning, using various dimensionality reduction and clustering techniques, including UMAP and HDBScan. After manually assigning topics to around 20 message clusters, we applied a semi-supervised technique. This approach takes a small amount of labeled data, combined with a larger amount of unlabeled data. We hand-labeled a few representative seed messages from each topic, and used them to find additional examples that were similar. For instance, given a seed message of “Can you help me order?”, we used the embeddings to help us find similar messages such as “How to order?” and “How can I get my orders?”. We then sampled from these to iteratively build the training data.

        A visualization using a scatter plot of message clusters during one of our explorations.
        A visualization of message clusters during one of our explorations.

        We used this dataset to train a simple predictive model containing an embedding layer, followed by two fully connected, dense layers. Our last layer contained the logits array for the number of classes to predict on.

        This model gave us some interesting insights. For example, we observed that a lot of chat messages are about the status of an order. This helped inform our decision to build an order status request as part of Shopify Inbox’s Instant Answers FAQ feature. However, our internal prototype had a lot of room for improvement. Overall, our model achieved a 70 percent accuracy rate and could only classify 35 percent of all messages with high confidence (what we call coverage). While our scrappy approach of using embeddings to label messages was fast, the labels weren’t always the ground truth for each message. Clearly, we had some work to do.

        We know that our merchants have busy lives and want to respond quickly to buyer messages, so we needed to increase the accuracy, coverage, and speed for version 2.0. Wanting to follow a data-centric approach, we focused on how we could improve our data to improve our performance. We made the decision to put additional effort into defining the training data by re-visiting the message labels, while also getting help to manually annotate more messages. We sought to do all of this in a more systematic way.

        Creating A New Taxonomy

        First, we dug deeper into the topics and message clusters used to train our prototype. We found several broad topics containing hundreds of examples that conflated distinct semantic meanings. For example, messages asking about shipping availability to various destinations (pre-purchase) were grouped in the same topic as those asking about what the status of an order was (post-purchase).

        Other topics had very few examples, while a large number of messages didn’t belong to any specific topic at all. It’s no wonder that a model trained on such a highly unbalanced dataset wasn’t able to achieve high accuracy or coverage.

        We needed a new labeling system that would be accurate and useful for our merchants. It also had to be unambiguous and easy to understand by annotators, so that labels would be applied consistently. A win-win for everybody!

        This got us thinking: who could help us with the taxonomy definition and the annotation process? Fortunately, we have a talented group of colleagues. We worked with our staff content designer and product researcher who have domain expertise in Shopify Inbox. We were also able to secure part-time help from a group of support advisors who deeply understand Shopify and our merchants (and by extension, their buyers).

        Over a period of two months, we got to work sifting through hundreds of messages and came up with a new taxonomy. We listed each new topic in a spreadsheet, along with a detailed description, cross-references, disambiguations, and sample messages. This document would serve as the source of truth for everyone in the project (data scientists, software engineers, and annotators).

        In parallel with the taxonomy work, we also looked at the latest pre-trained NLP models, with the aim of fine-tuning one of them for our needs. The Transformer family is one of the most popular, and we were already using that architecture in our product categorization model. We settled on DistilBERT, a model that promised a good balance between performance, resource usage, and accuracy. Some prototyping on a small dataset built from our nascent taxonomy was very promising: the model was already performing better than version 1.0, so we decided to double down on obtaining a high-quality, labeled dataset.

        Our final taxonomy contained more than 40 topics, grouped under five categories: 

        • Products
        • Pre-Purchase
        • Post-Purchase
        • Store
        • Miscellaneous

        We arrived at this hierarchy by thinking about how an annotator might approach classifying a message, viewed through the lens of a buyer. The first thing to determine is: where was the buyer on their purchase journey when the message was sent? Were they asking about a detail of the product, like its color or size? Or, was the buyer inquiring about payment methods? Or, maybe the product was broken, and they wanted a refund? Identifying the category helped to narrow down our topic list during the annotation process.

        Our in-house annotation tool displaying the message to classify at top of screen with a list possible topics grouped by category below
        Our in-house annotation tool displaying the message to classify, along with some of the possible topics, grouped by category

        Each category contains an other topic to group the messages that don’t have enough content to be clearly associated with a specific topic. We decided to not train the model with the examples classified as other because, by definition, they were messages we couldn’t classify ourselves in the proposed taxonomy. In production, these messages get classified by the model with low probabilities. By setting a probability threshold on every topic in the taxonomy, we could decide later on whether to ignore them or not.

        Since this taxonomy was pretty large, we wanted to make sure that everyone interpreted it consistently. We held several training sessions with our annotation team to describe our classification project and philosophy. We divided the annotators into two groups so they could annotate the same set of messages using our taxonomy. This exercise had a two-fold benefit:

        1. It gave annotators first-hand experience using our in-house annotation tool.
        2. It allowed us to measure inter-annotator agreement.

        This process was time-consuming as we needed to do several rounds of exercises. But, the training led us to refine the taxonomy itself by eliminating inconsistencies, clarifying descriptions, adding additional examples, and adding or removing topics. It also gave us reassurance that the annotators were aligned on the task of classifying messages.

        Let The Annotation Begin

        Once we and the annotators felt that they were ready, the group began to annotate messages. We set up a Slack channel for everyone to collaborate and work through tricky messages as they arose. This allowed everyone to see the thought process used to arrive at a classification.

        During the preprocessing of training data, we discarded single-character messages and messages consisting of only emojis. During the annotation phase, we excluded other kinds of noise from our training data. Annotators also flagged content that wasn’t actually a message typed by a buyer, such as when a buyer cut-and-pastes the body of an email they’ve received from a Shopify store confirming their purchase. As the old saying goes, garbage in, garbage out. Lastly, due to our current scope and resource constraints, we had to set aside non-English messages.

        Handling Sensitive Information

        You might be wondering how we dealt with personal information (PI) like emails or phone numbers. PI occasionally shows up in buyer messages and we took steps to handle this information. We identified the messages with PI, then replaced such data with mock data.

        This process began with our annotators flagging messages containing PI. Next, we used an open-source library called Presidio to analyze and replace the flagged PI. This tool ran in our data warehouse, keeping our merchants’ data within Shopify’s systems. Presidio is able to recognize many different types of PI, and the anonymizer provides different kinds of operators that can transform the instances of PI into something else. For example, you could completely remove it, mask part of it, or replace it with something else.

        In our case, we used another open-source tool called Faker to replace the PI. This library is customizable and localized, and its providers can generate realistic addresses, names, locations, URLs, and more. Here’s an example of its Python API:

        Combining Presidio and Faker allowed us to automate parts of the PI replacement, see below for a fabricated example of running Presidio over a message:


        can i pickup today? i ordered this am: Sahar Singh my phone is 852 5555 1234. Email is

        After running Presidio

        can i pickup today? i ordered this am: Sahar Singh my phone is 090-722-7549. Email is


        After running Presidio, we also inspected the before and after output to identify whether any PI was still present. Any remaining PI was manually replaced with a placeholder (for example, the name "Sahar Singh" in the fabricated example above would be replaced with <PERSON>). Finally, we ran another script to replace the placeholders with Faker-generated data.

        A Little Help From The Trends

        Towards the end of our annotation project, we noticed a trend that persisted throughout the campaign: some topics in our taxonomy were overrepresented in the training data. It turns out that buyers ask a lot of questions about products!

        Our annotators had already gone through thousands of messages. We couldn’t afford to split up the topics with the most popular messages and re-classify them, but how could we ensure our model performed well on the minority classes? We needed to get more training examples from the underrepresented topics.

        Since we were continuously training a model on the labeled messages as they became available, we decided to use it to help us find additional messages. Using the model’s predictions, we excluded any messages classified with the overrepresented topics. The remaining examples belonged to the other topics, or were ones that the model was uncertain about. These messages were then manually labeled by our annotators.


        So, after all of this effort to create a high-quality, consistently labeled dataset, what was the outcome? How did it compare to our first prototype? Not bad at all. We achieved our goal of higher accuracy and coverage:


        Version 1.0 Prototype

        Version 2.0 in Production

        Size of training set



        Annotation strategy

        Based on embedding similarity

        Human labeled

        Taxonomy classes



        Model accuracy



        High confidence coverage




        Another key part of our success was working collaboratively with diverse subject matter experts. Bringing in our support advisors, staff content designer, and product researcher provided perspectives and expertise that we as data scientists couldn’t achieve alone.

        While we shipped something we’re proud of, our work isn’t done. This is a living project that will require continued development. As trends and sentiments change over time, the topics of conversations happening in Shopify Inbox will shift accordingly. We’ll need to keep our taxonomy and training data up-to-date and create new models to continue to keep our standards high.

        If you want to learn more about the data work behind Shopify Inbox, check out Building a Real-time Buyer Signal Data Pipeline for Shopify Inbox that details the real-time buyer signal data pipeline we built.

        Eric Fung is a senior data scientist on the Messaging team at Shopify. He loves baking and will always pick something he’s never tried at a restaurant. Follow Eric on Twitter.

        Diego Castañeda is a senior data scientist on the Core Optimize Data team at Shopify. Previously, he was part of the Messaging team and helped create machine learning powered features for Inbox. He loves computers, astronomy and soccer. Connect with Diego on LinkedIn.

        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Data Science & Engineering career page to find out about our open positions. Learn about how we’re hiring to design the future together—a future that is Digital by Design.

        Continue reading

        Spin Infrastructure Adventures: Containers, Systemd, and CGroups

        Spin Infrastructure Adventures: Containers, Systemd, and CGroups

        The Spin infrastructure team works hard at improving the stability of the system. In February 2022 we moved to Container Optimized OS (COS), the Google maintained operating system for their Kubernetes Engine SaaS offering. A month later we turned on multi-cluster to allow for increased scalability as more users came on board. Recently, we’ve increased default resources allotted to instances dramatically. However, with all these changes we’re still experiencing some issues, and for one of those, I wanted to dive a bit deeper in a post to share with you.

        Spin’s Basic Building Blocks

        First it's important to know the basic building blocks of Spin and how these systems interact. The Spin infrastructure is built on top of Kubernetes, using many of the same components that Shopify’s production applications use. Spin instances themselves are implemented via a custom resource controller that we install on the system during creation. Among other things, the controller transforms the Instance custom resource into a pod that’s booted from a special Isospin container image along with the configuration supplied during instance creation. Inside the container we utilize systemd as a process manager and workflow engine to initialize the environment, including installing dotfiles, pulling application source code, and running through bootstrap scripts. Systemd is vital because it enables a structured way to manage system initialization and this is used heavily by Spin.

        There’s definitely more to what makes Spin then what I’ve described, but from a high level and for the purposes of understanding the technical challenges ahead it's important to remember that:

        1. Spin is built on Kubernetes
        2. Instances are run in a container
        3. systemd is run INSIDE the container to manage the environment.

        First Encounter

        In February 2022, we had a serious problem with Pod relocations that we eventually tracked to node instability. We had several nodes in our Kubernetes clusters that would randomly fail and require either a reboot or to be replaced entirely. Google had decent automation for this that would catch nodes in a bad state and replace them automatically, but it was occurring often enough (five nodes per day or about one percent of all nodes) that users began to notice. Through various discussions with Shopify’s engineering infrastructure support team and Google Cloud support we eventually honed in on memory consumption being the primary issue. Specifically, nodes were running out of memory and pods were being out of memory (OOM) killed as a result. At first, this didn’t seem so suspicious, we gave users the ability to do whatever they want inside their containers and didn’t provide much resources to them (8 to 12 GB of RAM each), so it was a natural assumption that containers were, rightfully, just using too many resources. However, we found some extra information that made us think otherwise.

        First, the containers being OOM killed would occasionally be the only Spin instance on the node and when we looked at their memory usage, often it would be below the memory limit allotted to them.

        Second, in parallel to this another engineer was investigating an issue with respect to Kafka performance where he identified a healthy running instance using far more resources than should have been possible.

        The first issue would eventually be connected to a memory leak that the host node was experiencing, and through some trial and error we found that switching the host OS from Ubuntu to Container Optimized OS from Google solved it. The second issue remained a mystery. With the rollout of COS though, we saw 100 times reduction in OOM kills, which was sufficient for our goals and we began to direct our attention to other priorities.

        Second Encounter

        Fast forward a few months to May 2022. We were experiencing better stability which was a source of relief for the Spin team. Our ATC rotations weren't significantly less frantic, the infrastructure team had the chance to roll out important improvements including multi-cluster support and a whole new snapshotting process. Overall things felt much better.

        Slowly but surely over the course of a few weeks, we started to see increased reports of instance instability. We verified that the nodes weren’t leaking memory as before, so it wasn’t a regression. This is when several team members re-discovered the excess memory usage issue we’d seen before, but this time we decided to dive a little further.

        We needed a clean environment to do the analysis, so we set up a new spin instance on its own node. During our test, we monitored the Pod resource usage and the resource usage of the node it was running on. We used kubectl top pod and kubectl top node to do this. Before we performed any tests we saw

        Next, we needed to simulate memory load inside of the container. We opted to use a tool called stress, allowing us to start a process that consumes a specified amount of memory that we could use to exercise the system.

        We ran kubectl exec -it spin-muhc – bash to land inside of a shell in the container and then stress -m 1 --vm-bytes 10G --vm-hang 0 to start the test.

        Checking the resource usage again we saw

        This was great, exactly what we expected. The 10GB used by our stress test showed up in our metrics. Also, when we checked the cgroup assigned to the process we saw it was correctly assigned to the Kubernetes Pod:

        Where 24899 was the PID of the process started by stress.This looked great as well. Next, we performed the same test, but in the instance environment accessed via spin shell. Checking the resource usage we saw

        Now this was odd. Here we saw that the memory created by stress wasn’t showing up under the Pod stats (still only 14Mi), but it was showing up for the node (33504Mi). Checking the usage from in the container we saw that it was indeed holding onto memory as expected

        However, when we checked the cgroup this time, we saw something new:

        What the heck!? Why was the cgroup different? We double checked that this was the correct hierarchy by using the systemd cgroup list tool from within the spin instance: 

        So to summarize what we had seen: 

        1. When we run processes inside the container via kubectl exec, they’re correctly placed within the kubepods cgroup hierarchy. This is the hierarchy that contains the pods memory limits.
        2. When we run the same processes inside the container via spin shell, they’re placed within a cgroup hierarchy that doesn’t contain the limits. We verify this by checking the cgroup file directly:

        The value above is close to the maximum value of a 64 bit integer (about 8.5 Billion Gigabytes of memory). Needless to say, our system has less than that, so this is effectively unlimited.

        For practical purposes, this means any resource limitation we put on the Pod that runs Spin instances isn’t being honored. So Spin instances can use more memory than they’re allotted which is concerning for a few reasons, but probably most importantly, we depend on this to avoid instances from interfering with one another.

        Isolating It

        In a complex environment like Spin it’s hard to account for everything that might be affecting the system. Sometimes it’s best to distill problems down to the essential details to properly isolate the issue. We were able to reproduce the cgroup leak in a few different ways. First on the Spin instances directly using crictl or ctr and custom arguments with real Spin instances. Second, running on a local Docker environment . Setting up an experiment like this also allowed for much quicker iteration time when testing potential fixes.

        From the experiments we discovered differences between the different runtimes (containerd, Docker, and Podman) execution of systemd containers. Podman for instance has a --systemd flag that enables and disables an integration with the host systemd. containerd has a similar flag –runc-systemd-cgroup that starts runc with the systemd cgroup manager. For Docker, however, no such integration exists (you can modify the cgroup manager via daemon.json, but not via the CLI like Podman and Containerd) and we saw the same cgroup leakage. When comparing the cgroups assigned to the container processes between Docker and Podman, we saw the following




        Podman placed the systemd and stress processes in a cgroup unique to the container. This allowed Podman to properly delegate the resource limitations to both systemd and any process that systemd spawns. This was the behavior we were looking for!

        The Fix

        We now had an example of a systemd container properly being isolated from the host with Podman. The trouble was that in our Spin production environments we use Kubernetes that uses Containerd, not Podman, for the container runtime. So how could we leverage what we learned from Podman toward a solution?

        While investigating differences between Podman and Docker with respect to Systemd we came across the crux of the fix. By default Docker and containerd use a cgroup driver called cgroupfs to manage the allocation of resources while Podman uses the systemd driver (this is specific to our host operating system COS from Google). The systemd driver delegates responsibility of cgroup management to the host systemd which then properly manages the delegate systemd that’s running in the container. 

        It’s recommended for nodes running systemd on the host to use the systemd cgroup driver by default, however, COS from Google is still set to use cgroupfs. Checking the developer release notes, we see that in version 101 of COS there is a mention of switching the default cgroup driver to systemd, so the fix is coming!

        What’s Next

        Debugging this issue was an enlightening experience. If you had asked us before, Is it possible for a container to use more resources than its assigned?, we would have said no. But now that we understand more about how containers deliver the sandbox they provide, it’s become clear the answer should have been, It depends.

        Ultimately the escaped permissions were from us bind-mounting /sys/fs/cgroup read-only into the container. A subtle side effect of this, while this directory isn’t writable, all sub directories are. But since this is required of systemd to even boot up, we don’t have the option to remove it. There’s a lot of ongoing work by the container community to get systemd to exist peacefully within containers, but for now we’ll have to make do.


        Special thanks to Daniel Walsh from RedHat for writing so much on the topic. And Josh Heinrichs from the Spin team for investigating the issue and discovering the fix.

        Additional Information

        Chris is an infrastructure engineer with a focus on developer platforms. He’s also a member of the ServiceMeshCon program committee and a @Linkerd ambassador.

        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        Shopify and Open Source: A Mutually Beneficial Relationship

        Shopify and Open Source: A Mutually Beneficial Relationship

        Shopify and Rails have grown up together. Both were in their infancy in 2004, and our CEO (Tobi) was one of the first contributors and a member of Rails Core. Shopify was built on top of Rails, and our engineering culture is rooted in the Rails Doctrine, from developer happiness to the omakase menu, sharp knives, and majestic monoliths. We embody the doctrine pillars. 

        Shopify's success is due, in part, to Ruby and Rails. We feel obligated to pay that success back to the community as best we can. But our commitment and investment are about more than just paying off that debt; we have a more meaningful and mutually beneficial goal.

        One Hundred Year Mission

        At Shopify, we often talk about aspiring to be a 100-year company–to still be around in 2122! That feels like an ambitious dream, but we make decisions and build our code so that it scales as we grow with that goal in mind. If we pull that off, will Ruby and Rails still be our tech stack? It's hard to answer, but it's part of my job to think about that tech stack over the next 100 years.

        Ruby and Rails as 100-year tools? What does that even mean?

        To get to 100 years, Rails has to be more than an easy way to get started on a new project. It's about cost-effective performance in production, well-formed opinions on the application architecture, easy upgrades, great editors, avoiding antipatterns, and choosing when you want the benefits of typing. 

        To get to 100 years, Ruby and Rails have to merit being the tool of choice, every day, for large teams and well-aged projects for a hundred years. They have to be the tool of choice for thousands of developers, across millions of lines of code, handling billions of web requests. That's the vision. That's Rails at scale.

        And that scale is where Shopify is investing.

        Why Companies Should Invest In Open Source

        Open source is the heart and soul of Rails: I’d say that Rails would be nowhere near what it is today if not for the open source community.

        Rafael França, Shopify Principal Engineer and Rails Core Team Member

        We invest in open source to build the most stable, resilient, performant version to grow our applications. How much better could it be if more people were contributing? As a community, we can do more. Ruby and Rails can only continue to be a choice for companies if we're actively investing in the development, and to do that; we need more companies involved in contributing.

        It Improves Engineering Skills

        Practice makes progress! Building open source software with cross-functional teams helps build better communication skills and offers opportunities to navigate feedback and criticism constructively. It also enables you to flex your debugging muscles and develop deep expertise in how the framework functions, which helps you build better, more stable applications for your company.

        It’s Essential to Application Health & Longevity

        Contributing to open source helps ensure that Rails benefits your application and the company in the long term. We contribute because we care about the changes and how they affect our applications. Investing upfront in the foundation is proactive, whereas rewrites and monkey patches are reactive and lead to brittle code that's hard to maintain and upgrade.

        At our scale, it's common to find issues with, or opportunities to enhance, the software we use. Why keep those improvements private? Because we build on open source software, it makes sense to contribute to those projects to ensure that they will be as great as possible for as long as possible. If we contribute to the community, it increases our influence on the software that our success is built on and helps improve our chances of becoming a 100-year company. This is why we make contributions to Ruby and Rails, and other open source projects. The commitment and investment are significant, but so are the benefits.

        How We're Investing in Ruby and Rails

        Shopify is built on a foundation of open source software, and we want to ensure that that foundation continues to thrive for years to come and that it continues to scale to meet our requirements. That foundation can’t succeed without investment and contribution from developers and companies. We don’t believe that open source development is “someone else’s problem”. We are committed to Ruby and Rails projects because the investment helps us future-proof our foundation and, therefore, Shopify. 

        We contribute to strategic projects and invest in initiatives that impact developer experience, performance, and security—not just for Shopify but for the greater community. Here are some projects we’re investing in:

        Improving Developer Tooling 

        • We’ve open-sourced projects like toxiproxy, bootsnap, packwerk, tapioca, paquito, and maintenance_tasks that are niche tools we found we needed. If we need them, other developers likely need them as well.
        • We helped add Rails support to Sorbet's gradual typing to make typing better for everyone.
        • We're working to make Ruby support in VScode best-in-class with pre-configured extensions and powerful features like refactorings.
        • We're working on automating upgrades between Ruby and Rails versions to reduce friction for developers.

        Increasing Performance

        Enhancing Security

        • We're actively contributing to bundler and rubygems to make Ruby's supply chain best-in-class.
        • We're partnering with Ruby Central to ensure the long-term success and security of through strategic investments in engineering, security-related projects, critical tools and libraries, and improving the cycle time for contributors.

        Meet Shopify Contributors

        The biggest investment you can make is to be directly involved in the future of the tools that your company relies on. We believe we are all responsible for the sustainability and quality of open source. Shopify engineers are encouraged to contribute to open source projects where possible. The commitment varies. Some engineers make occasional contributions, some are part-time maintainers of important open source libraries that we depend on, and some are full-time contributors to critical open source projects.

        Meet some of the Shopify engineers contributing to open source. Some of those faces are probably familiar because we have some well-known experts on the team. But some you might not know…yet. We're growing the next generation of Ruby and Rails experts to build for the future.

        Mike is a NYC-based engineering leader who's worked in a variety of domains, including energy management systems, bond pricing, high-performance computing, agile consulting, and cloud computing platforms. He is an active member of the Ruby open-source community, where as a maintainer of a few popular libraries he occasionally still gets to write software. Mike has spent the past decade growing inclusive engineering organizations and delivering amazing software products for Pivotal, VMware, and Shopify.

        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        The Story Behind Shopify’s Isospin Tooling

        The Story Behind Shopify’s Isospin Tooling

        You may have read that Shopify has built an in-house cloud development platform named Spin. In that post, we covered the history of the platform and how it powers our everyday work. In this post, we’ll take a deeper dive into one specific aspect of Spin: Isospin, Shopify’s systemd-based tooling that forms the core of how we run applications within Spin.

        The initial implementation of Spin used the time-honored POSS (Pile of Shell Scripts) design pattern. As we moved to a model where all of our applications ran in a single Linux VM, we were quickly outgrowing our tooling⁠—not to mention the added complexity of managing multiple applications within a single machine. Decisions such as what dependency services to run, in what part of the boot process, and how many copies to run became much more difficult as we ran many applications together within the same instance. Specifically, we needed a way to:

        • split up an application into its component parts
        • specify the dependencies between those parts
        • have those jobs be scheduled at the appropriate times
        • isolate services and processes from each other.

        At a certain point, stepping back, an obvious answer began to emerge. The needs we were describing weren’t merely solvable, they were already solved—by something we were already using. We were describing services, the same as any other services run by the OS. There were already tools to solve this built right into the OS. Why not leverage that?

        A Lightning Tour of systemd

        systemd’s service management works by dividing the system into a graph of units representing individual units or jobs. Each unit can specify its dependencies on other units, in granular detail, allowing systemd to determine an order to launch services in order to bring the system up and reason about cascading failure states. In addition to units representing actual services, it supports targets, which represent an abstract grouping of one or more units. Targets can have dependencies of their own and be depended on by other units, but perform no actual work. By specifying targets representing phases of the boot process and a top-level target representing the desired state of the system, systemd can quickly and comprehensively prepare services to run.

        systemd has several features which enable dynamic generation of units. Since we were injecting multiple apps into a system at runtime, with varying dependencies and processes, we made heavy use of these features to enable us to create complex systemd service graphs on the fly.

        The first of these is template unit files. Ordinarily, systemd namespaces units via their names; any service named foo will satisfy a dependency on the service named foo, and only one instance of a unit with a name can be running at once. This was obviously not ideal for us, since we have many services that we’re running per-application. Template unit files expand this distinction a bit by allowing a service to take a parameter that becomes part of its namespace. For example, a service named foo@.service could take the argument bar, running as foo@bar. This allows multiple copies of the same service to run simultaneously. The parameter is also available within the unit as a variable, allowing us to namespace runtime directories and other values with the same parameter.

        Template units were key to us since not only do they allow us to share service definitions for applications themselves, they allow us to run multiple copies of dependency services. In order to maintain full isolation between applications—and to simulate the separately-networked services they would be talking to in production—neighbor apps within a single Isospin VM don’t use the same installation of core services such as MySQL or Elasticsearch. Instead, we run one copy of these services for each app that needs it. Template units simplified this process greatly and via a single service definition, we simply run as many copies of each as we need.

        We also made use of generators, a systemd feature that allows dynamically creating units at runtime. This was useful for us since the dynamic state of our system meant that a fixed service order wasn’t really feasible. There were two primary features of Isospin’s setup that complicated things:

        1. Which app or apps to run in the system isn’t fixed, but rather is assigned when we boot a system. Thus, via information assigned at the time the system is booted, we need to choose which top-level services to enable.

        2. While many of the Spin-specific services are run for every app, dependencies on other services are dynamic. Not every app requires MySQL or Elasticsearch or so on. We needed a way to specify these systemd-level dependencies dynamically.

        Generators provided a simple way to run this. Early in the bootup process, we run a generator that creates a target named spin-app for each app to be run in the system. That target contains all of the top-level dependencies an app needs to run, and is then assigned as a dependency of the system is running target. Despite sounding complex, this requires no more than a 28-line bash script and a simple template file for the service. Likewise, we’re able to assign the appropriate dependency services as requirements of this spin-app target via another generator that runs later in the process.

        Booting Up an Example

        To help understand how this works in action, let’s walk through an example of the Isospin boot process.

        We start by creating a target named, which we use to represent Spin has finished booting. We’ll use this target later to determine whether or not the system has successfully finished starting up. We then run a generator named apps that checks the configuration to see which apps we’ve specified for the system. It then generates new dependencies on the spin-app@ target, requesting one instance per application and passing in the name of the application as its parameter.

        spin-app@ depends on several of the core services that represent a fully available Spin application, including several more generators. Via those dependencies, we run the spin-svcs@ generator to determine which system-level service dependencies to inject, such as MySQL or Elasticsearch. We also run the spin-procs@ generator that determines which command or commands to run the application itself and generates one service per command.

        Finally, we bring the app up via the spin-init@ service and its dependencies. spin-init@ represents the final state of bootstrapping necessary for the application to be ready to run, and via its recursive dependencies systemd builds out the chain of processes necessary to clone an application’s source, run bootstrap processes, and then run any necessary finalizing tasks before it’s ready to run.

        Additional Tools (and Problems)

        Although the previously described tooling got us very far, we found that we had a few additional problems that required some additional tooling to fix.

        A problem we encountered under this new model was port collision between services. In the past, our apps were able to assume they were the only app in a service, so they could claim a common port for themselves without conflict. Although systemd gave us a lot of process isolation for free, this was a hole we’d dug for ourselves and one we’d need to get out of by ourselves too. 

        The solution we settled on was simple but effective, and one that leveraged a few systemd features to simplify the process. We reasoned that port collision is only a problem because port selection was in the user’s hands. We could solve this by making port assignment the OS’s responsibility. We created a service that handles port assignment programmatically via a hashing process—by taking the service’s name into account we produce a semi-stable automated port assignment that avoids port collision with any other ports we’ve assigned on the system. This service can be used as a dependency of another service that needs to bind to a port and writes the generated port to an environment path that can be used as input to another service to inject environment variables. As long as we specify this as a dependency, we can ensure that the dependent service receives a PORT variable that it’s meant to respect and bind to.

        Another feature that came in handy is systemd’s concept of service readiness. Many process runners, including the Foreman-based solutions we’d been using in the past, have a binary concept of service readiness (either a process is running, or it isn’t) and if a process exits unexpectedly it’s considered failed

        systemd has the same model by default, but it also supports something more complex: it allows configuring a notify socket that allows an application to explicitly communicate its readiness. systemd exposes a Unix datagram socket to the service it’s running via the NOTIFY_SOCKET environment variable. When the underlying app has finished starting up and is ready, it can communicate that status via writing a message to the socket. This granularity helps avoid some of the rare but annoying gotchas with a more simple model of service readiness. It ensures that the service is only considered ready to accept connections when it's actually ready, avoiding a scenario in which external services try sending messages during the startup window. It also avoids a situation where the process remains running but the underlying service has failed during startup.

        Some of the external services we depend on use this, such as MySQL, but we also wrote our own tooling to incorporate it. Our notify-port script is a thin wrapper around web applications that monitors whether the service we’re wrapping has begun accepting HTTP connections over the port Isospin has assigned to it. By polling the port and notifying systemd when it comes up, we’ve been able to catch many real world bugs where services were waiting on the wrong port, and situations in which a server failed on startup while leaving the process alive.

        Isospin on Top

        Although we started out with some relatively simple goals, the more we worked with systemd the more we found ourselves able to leverage its tools to our advantage. By building Isospin on top of systemd, we found ourselves able to save time by reusing pre-existing structures that suited our needs and took advantage of sophisticated tooling for expressing service interdependency and service health. 

        Going forward, we plan to continue expanding on Isospin to express more complex service relationships. For example, we’re investigating the use of systemd service dependencies to allow teams to express that certain parts of their application relies on another team’s application being available.

        Misty De Méo is an engineer who specializes in developer tooling. She’s been a maintainer of the Homebrew package manager since 2011.

        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        How to Build Trust as a New Manager in a Fully Remote Team

        How to Build Trust as a New Manager in a Fully Remote Team

        I had been in the same engineering org for seven years before the pandemic hit. We were a highly collaborative and co-located company and the team were very used to brainstorming and working on physical whiteboards and workspaces. When it did, we moved pretty seamlessly from working together from our office to working from our homes. We didn’t pay too much attention to designing a Digital First culture and neither did we alter our ways of working dramatically. 

        It was only when I joined Shopify last September that I began to realize that working remotely in a company where you have already established trust is very different from starting in a fully remote space and building trust from the ground up. 

        What Is Different About Starting Remotely?

        So, what changes? The one word that comes to mind is intentionality. I would define intentionality as “the act of thoughtfully designing interactions or processes.” A lot of things that happen seamlessly and organically in a real-life setting takes more intentionality in a remote setting. If you deconstruct the process of building trust, you’ll find that in a physical setting trust is built in active ways (words you speak, your actions, and your expertise), but also in passive ways (your body language, demeanor, and casual water cooler talk). In a remote setting, it’s much more difficult to observe, but also build casual, non-transactional relationships with people unless you’re intentional about it. 

        Also, since you’re represented a lot more through your active voice, it’s important to work on setting up a new way of working and gaining mastery over the set of tools and skills that will help build trust and create success in your new environment.

        The 90-Day Plan

        The 90-Day Plan is named after the famous book The First 90 Days written by Michael D. Watkins. Essentially, it breaks down your onboarding journey into three steps:

        1. First 30 days: focus on your environment
        2. First 60 days: focus on your team
        3. 90 days and beyond: focus on yourself.

        First 30 Days: Focus on Your Environment

        Take the time out to think about what kind of workplace you want to create and also to reach out and understand the tone of the wider organization that you are part of.

        Study the Building 

        When you start a new job in a physical location, it’s common to study the office and understand the layout of, not only the building, but also  the company itself. When beginning work remotely, I suggest you start with a metaphorical study of the building. Try and understand the wider context of the organization and the people in it. You can do this with a mix of pairing sessions, one-on-ones, and peer group sessions. These processes help you in gaining technical and organizational context and also build relationships with peers. 

        Set Up the Right Tools

        In an office, many details of workplace setup are abstracted away from you. In a fully digital environment, you need to pay attention to set your workplace up for success. There are plenty of materials available on how to set up your home office (like on and Ensure that you take the time to set up your remote tools to your taste. 

        Build Relationships 

        If you’re remote, it’s easy to be transactional with people outside of your immediate organization. However, it’s much more fun and rewarding to take the time to build relationships with people from different backgrounds across the company. It gives you a wider context of what the company is doing and the different challenges and opportunities.

        First 60 Days: Focus on Your Team

        Use asynchronous communication for productivity and synchronous for connection.

        Establish Connection and Trust

        When you start leading a remote team, the first thing to do is establish connection and trust. You do this by spending a lot of your time in the initial days meeting your team members and understanding their aspirations and expectations. You should also, if possible, attempt to meet the team once in real life within a few months of starting. 

        Meet in Real Life

        Meeting in real life will help you form deep human relationships and understand them beyond the limited scope of workplace transactions. Once you’ve done this, ensure that you create a mix of synchronous and asynchronous processes within your time. Examples of asynchronous processes are automated dailies, code reviews, receiving feedback, and collaboration on technical design documents. We use synchronous meetings for team retros, coffee sessions, demos, and planning sessions. We try to maximize async productivity and being intentional about the times that you do come together as a team.

        Establishing Psychological Safety

        The important thing about leading a team remotely is to firmly establish a strong culture of psychological safety. Psychological safety in the workplace is important, not only for teams to feel engaged, but also for members to thrive. While it might be trickier to establish psychological safety remotely, its definitely possible. Some of the ways to do it:

        1. Default to open communication wherever possible.
        2. Engage people to speak about issues openly during retros and all-hands meetings.
        3. Be transparent about things that have not worked well for you. Setting this example will help people open themselves up to be vulnerable with their teams.

        First 90 Days - Focus on Yourself

        How do you manage and moderate your own emotions as you find your way in a new organization with this new way of working?

        FOMO Is Real

        Starting in a new place is nerve wracking. Starting while fully remote can be a lonely exercise. Working in a global company like Shopify means that you need to get used to the fact that work is always happening in some part of the globe. It’s easy to get overwhelmed and be always on. While FOMO can be very real, be aware of all the new information that you’re ingesting and take the time to reflect upon it.

        Design Your Workday

        Remote work means you’re not chained to your nine-to-five routine anymore. Reflect on the possibilities this offers you and think about how you want to design your workday. Maybe you want meeting free times to walk the dog, hit the gym, or take a power nap. Ensure you think about how to structure the day in a way that suits your life and plan your agenda accordingly.

        Try New Things

        It’s pretty intense in the first few months as you try ways to establish trust and build a strong team together. Not everything you try will take and not everything will work. The important thing is to be clear with what you’re setting out to achieve, collect feedback on what works and doesn’t, learn from the experience, and move forward.

        Being able to work in a remote work environment is both rewarding and fun. It’s definitely a new superpower that, if used well, leads to rich and absorbing experiences. The first 90 days are just the beginning of this journey. Sit back, tighten your seatbelt and get ready for a joyride of learning and growth.

        Sadhana is an engineer, book nerd, constant learner and enabler of people. Worked in the industry for more than 20 years in various roles and tech stacks. Agile Enthusiast. You can connect with Sadhana on LinkedIn and Medium.

        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        Introducing ShopifyQL: Our New Commerce Data Querying Language 

        Introducing ShopifyQL: Our New Commerce Data Querying Language 

        At Shopify, we recognize the positive impact data-informed decisions have on the growth of a business. But we also recognize that data exploration is gated to those without a data science or coding background. To make it easier for our merchants to inform their decisions with data, we built an accessible, commerce-focused querying language. We call it ShopifyQL. ShopifyQL enables Shopify Plus merchants to explore their data with powerful features like easy to learn syntax, one-step data visualization, built-in period comparisons, and commerce-specific date functions. 

        I’ll discuss how ShopifyQL makes data exploration more accessible, then dive into the commerce-specific features we built into the language, and walk you through some query examples.

        Why We Built ShopifyQL

        As data scientists, engineers, and developers, we know that data is a key factor in business decisions across all industries. This is especially true for businesses that have achieved product market fit, where optimization decisions are more frequent. Now, commerce is a broad industry and the application of data is deeply personal to the context of an individual business, which is why we know it’s important that our merchants be able to explore their data in an accessible way.

        Standard dashboards offer a good solution for monitoring key metrics, while interactive reports with drill-down options allow deeper dives into understanding how those key metrics move. However, reports and dashboards help merchants understand what happened, but not why it happened. Often, merchants require custom data exploration to understand the why of a problem, or to investigate how different parts of the business were impacted by a set of decisions. For this, they turn to their data teams (if they have them) and the underlying data.

        Historically, our Shopify Plus merchants with data teams have employed a centralized approach in which data teams support multiple teams across the business. This strategy helps them maximize their data capability and consistently prioritizes data stakeholders in the business. Unfortunately, this leaves teams in constant competition for their data needs. Financial deep dives are prioritized over operational decision support. This leaves marketing, merchandising, fulfillment, inventory, and operations to fend for themselves. They’re then forced to either make decisions with the standard reports and dashboards available to them, or do their own custom data exploration (often in spreadsheets). Most often they end up in the worst case scenario: relying on their gut and leaving data out of the decision making process.

        Going past the reports and dashboards into the underlying datasets that drive them is guarded by complex data engineering concepts and languages like SQL. The basics of traditional data querying languages are easy to learn. However, applying querying languages to datasets requires experience with, and knowledge of, the entire data lifecycle (from data capture to data modeling). In some cases, simple commerce-specific data explorations like year-over-year sales require a more complicated query than the basic pattern of selecting data from some table with some filter. This isn’t a core competency of our average merchant. They get shut out from the data exploration process and the ability to inform their decisions with insights gleaned from custom data explorations. That’s why we built ShopifyQL.

        A Data Querying Language Built for Commerce

        We understand that merchants know their business the best and want to put the power of their data into their hands. Data-informed decision making is at the heart of every successful business, and with ShopifyQL we’re empowering Shopify Plus merchants to gain insights at every level of data analysis. 

        With our new data querying language, ShopifyQL, Shopify Plus merchants can easily query their online store data. ShopifyQL makes commerce data exploration accessible to non-technical users by simplifying traditional aspects of data querying like:

        • Building visualizations directly from the query, without having to manipulate data with additional tools.
        • Creating year-over-year analysis with one simple statement, instead of writing complicated SQL joins.
        • Referencing known commerce date ranges (For example, Black Friday), without having to remember the exact dates.
        • Accessing data specifically modeled for commerce exploration purposes, without having to connect the dots across different data sources. 

        Intuitive Syntax That Makes Data Exploration Easy

        The ShopifyQL syntax is designed to simplify the traditional complexities of data querying languages like SQL. The general syntax tree follows a familiar querying structure:

        FROM {table_name}
        SHOW|VISUALIZE {column1, column2,...} 
        TYPE {visualization_type}
        AS {alias1,alias2,...}
        BY {dimension|date}
        WHERE {condition}
        SINCE {date_offset}
        UNTIL {date_offset}
        ORDER BY {column} ASC|DESC
        COMPARE TO {date_offset}
        LIMIT {number}

        We kept some of the fundamentals of the traditional querying concepts because we believe these are the bedrock of any querying language:

        • FROM: choose the data table you want to query
        • SELECT: we changed the wording to SHOW because we believe that data needs to be seen to be understood. The behavior of the function remains the same: choose the fields you want to include in your query
        • GROUP BY: shortened to BY. Choose how you want to aggregate your metrics
        • WHERE: filter the query results
        • ORDER BY: customize the sorting of the query results
        • LIMIT: specify the number of rows returned by the query.

        On top of these foundations, we wanted to bring a commerce-centric view to querying data. Here’s what we are making available via Shopify today.

        1. Start with the context of the dataset before selecting dimensions or metrics

        We moved FROM to precede SHOW. It’s more intuitive for users to select the dataset they care about first and then the fields. When wanting to know conversion rates it's natural to think about product and then conversion rates, that's why we swapped the order of FROM and SHOW as compared to traditional querying languages.

        2. Visualize the results directly from the query

        Charts are one of the most effective ways of exploring data, and VISUALIZE aims to simplify this process. Most query languages and querying interfaces return data in tabular format and place the burden of visualizing that data on the end user. This means using multiple tools, manual steps, and copy pasting. The VISUALIZE keyword allows Shopify Plus merchants to display their data in a chart or graph visualization directly from a query. For example, if you’re looking to identify trends in multiple sales metrics for a particular product category:

        A screenshot showing the ShopifyQL code at the top of the screen and a line chart that uses VISUALIZE to chart monthly total and gross sales
        Using VISUALIZE to chart monthly total and gross sales

        We’ve made the querying process simpler by introducing smart defaults that allow you to get the same output with less lines of code. The query from above can also be written as:

        FROM sales
        VISUALIZE total_sales, gross_sales
        BY month
        WHERE product_category = ‘Shoes’
        SINCE -13m

        The query and the output relationship remains explicit, but the user is able to get to the result much faster.

        The following language features are currently being worked on, and will be available later this year:

        3. Period comparisons are native to the ShopifyQL experience

        Whether it’s year-over-year, month-over-month, or a custom date range, period comparison analyses are a staple in commerce analytics. With traditional querying languages, you either have to model a dataset to contain these comparisons as their own entries or write more complex queries that include window functions, common table expressions, or self joins. We’ve simplified that to a single statement. The COMPARE TO keyword allows ShopifyQL users to effortlessly perform period-over-period analysis. For instance, comparing this week’s sales data to last week:

        A screenshot showing the ShopifyQL code at the top of the screen and a line chart that uses VISUALIZE for comparing total sales between 2 time periods with COMPARE TO
        Comparing total sales between 2 time periods with COMPARE TO

        This powerful feature makes period-over-period exploration simpler and faster; no need to learn joins or window functions. Future development will enable multiple comparison periods for added functionality.

        4. Commerce specific date ranges simplify time period filtering

        Commerce-specific date ranges (for example Black Friday Cyber Monday, Christmas Holidays, or Easter) involve a manual lookup or a join to some holiday dataset. With ShopifyQL, we take care of the manual aspects of filtering for these date ranges and let the user focus on the analysis.

        The DURING statement, in conjunction with Shopify provided date ranges, allows ShopifyQL users to filter their query results by commerce-specific date ranges. For example, finding out what the top five selling products were during BFCM in 2021 versus 2020:

        A screenshot showing the ShopifyQL code at the top of the screen and a table that shows Product Title, Total Sales BFCM 2021, and Total Sales BFCM 2019
        Using DURING to simplify querying BFCM date ranges

        Future development will allow users to save their own date ranges unique to their business, giving them even more flexibility when exploring data for specific time periods.

        Check out our full list of current ShopifyQL features and language docs at

        Data Models That Simplify Commerce-Specific Analysis and Explorations

        ShopifyQL allows us to access data models that address commerce-specific use cases and abstract the complexities of data transformation. Traditionally, businesses trade off SQL query simplicity for functionality, which limits users’ ability to perform deep dives and explorations. Since they can’t customize the functionality of SQL, their only lever is data modeling. For example, if you want to make data exploration more accessible to business users via simple SQL, you have to either create one flat table that aggregates across all data sources, or a number of use case specific tables. While this approach is useful in answering simple business questions, users looking to dig deeper would have to write more complex queries to either join across multiple tables, leverage window functions and common table expressions, or use the raw data and SQL to create their own models. 

        Alongside ShopifyQL we’re building exploration data models that are able to answer questions across the entire spectrum of commerce: products, orders, and customers. Each model focuses on the necessary dimensions and metrics to enable data exploration associated with that domain. For example, our product exploration dataset allows users to explore all aspects of product sales such as conversion, returns, inventory, etc. The following characteristics allow us to keep these data model designs simple while maximizing the functionality of ShopifyQL:

        • Single flat tables aggregated to a lowest domain dimension grain and time attribute. There’ is no need for complicated joins, common table expressions, or window functions. Each table contains the necessary metrics that describe that domain’s interaction across the entire business, regardless of where the data is coming from (for example, product pageviews and inventory are product concerns from different business processes).
        • All metrics are fully additive across all dimensions. Users are able to leverage the ShopifyQL aggregation functions without worrying about which dimensions are conformed. This also makes table schemas relatable to spreadsheets, and easy to understand for business users with no experience in data modeling practices.
        • Datasets support overlapping use cases. Users can calculate key metrics like total sales in multiple exploration datasets, whether the focus is on products, orders, or customers. This allows users to reconcile their work and gives them confidence in the queries they write.

        Without the leverage of creating our own querying language, the characteristics above would require complex queries which would limit data exploration and analysis.

        ShopifyQL Is a Foundational Piece of Our Platform

        We built ShopifyQL for our Shopify Plus merchants, third-party developer partners, and ourselves as a way to serve merchant-facing commerce analytics. 

        Merchants can access ShopifyQL via our new first party app ShopifyQL Notebooks

        We used the ShopifyQL APIs to build an app that allows our Shopify Plus merchants to write ShopifyQL queries inside a traditional notebooks experience. The notebooks app gives users the ultimate freedom of exploring their data, performing deep dives, and creating comprehensive data stories. 

        ShopifyQL APIs enable our partners to easily develop analytics apps

        The Shopify platform allows third-party developers to build apps that enable merchants to fully customize their Shopify experience. We’ve built GraphQL endpoints for access to ShopifyQL and the underlying datasets. Developers can leverage these APIs to submit ShopifyQL queries and return the resulting data in the API response. This allows our developer partners to save time and resources by querying modeled data. For more information about our GraphQL API, check out our API documentation.

        ShopifyQL will power all analytical experiences on the Shopify platform

        We believe ShopifyQL can address all commerce analytics use cases. Our internal teams are going to leverage ShopifyQL to power the analytical experiences we create in the Shopify Admin—the online backend where merchants manage their stores. This helps us standardize our merchant-facing analytics interfaces across the business. Since we’re also the users of the language, we’re acutely aware of its gaps, and can make changes more quickly.

        Looking ahead

        We’re planning new language features designed to make querying with ShopifyQL even simpler and more powerful:

        • More visualizations: Line and bar charts are great but, we want to provide more visualization options that help users discover different insights. New visualizations on the roadmap include dual axis charts, funnels, annotations, scatter plots, and donut charts.
        • Pivoting: Pivoting data with a traditional SQL query is a complicated endeavor. We will simplify this with the capability to break down a metric by dimensional attributes in a columnar fashion. This will allow for charting trends of dimensional attributes across time for specific metrics with one simple query.
        • Aggregate conditions: Akin to a HAVING statement in SQL, we are building the capability for users to filter their queries on an aggregate condition. Unlike SQL, we’re going to allow for this pattern in the WHERE clause, removing the need for additional language syntax and keyword ordering complexity.

        As we continue to evolve ShopifyQL, our focus will remain on making commerce analytics more accessible to those looking to inform their decisions with data. We’ll continue to empower our developer partners to build comprehensive analytics apps, enable our merchants to make the most out of their data, and support our internal teams with powering their merchant-facing analytical use cases.

        Ranko is a product manager working on ShopifyQL and data products at Shopify. He's passionate about making data informed decisions more accessible to merchants.

        Are you passionate about solving data problems and eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.

        Continue reading

        Making Open Source Safer for Everyone with Shopify’s Bug Bounty Program

        Making Open Source Safer for Everyone with Shopify’s Bug Bounty Program

        Zack Deveau, Senior Application Security Engineer at Shopify, shares the details behind a recent contribution to the Rails library, inspired by a bug bounty report we received. He'll go over the report and its root cause, how we fixed it in our system, and how we took it a step further to make Rails more secure by updating the default serializer for a few classes to use safe defaults.

        Continue reading

        How We Built Shopify Party

        How We Built Shopify Party

        Shopify Party is a browser-based internal tool that we built to make our virtual hangouts more fun. With Shopify’s move to remote, we wanted to explore how to give people a break from video fatigue and create a new space designed for social interaction. Here's how we built it.

        Continue reading

        8 Data Conferences Shopify Data Thinks You Should Attend

        8 Data Conferences Shopify Data Thinks You Should Attend

        Our mission at Shopify is to make commerce better for everyone. Doing this in the long term requires constant learning and development – which happen to be core values here at Shopify.

        Learning and development aren’t exclusively internal endeavors, they also depend on broadening your horizons and gaining insight from what others are doing in your field. One of our favorite formats for learning are conferences.

        Conferences are an excellent way to hear from peers about the latest applications, techniques, and use cases in data science. They’re also great for networking and getting involved in the larger data community.

        We asked our data scientists and engineers to curate a list of the top upcoming data conferences for 2022. Whether you’re looking for a virtual conference or in-person learning, we’ve got something that works for everyone.

        Hybrid Events

        Data + AI Summit 2022

        When: June 27-30
        Where: Virtual or in-person (San Francisco)

        The Data + AI Summit 2022 is a global event that provides access to some of the top experts in the data industry through keynotes, technical sessions, hands-on training, and networking. This four-day event explores various topics and technologies ranging from business intelligence, data analytics, and machine learning, to working with Presto, Looker, and Kedro. There’s great content on how to leverage open source technology (like Spark) to develop practical ways of dealing with large volumes of data. I definitely walked away with newfound knowledge.

        Ehsan K. Asl, Senior Data Engineer

        Transform 2022

        When: July 19-28
        Where: Virtual or in-person (San Francisco)

        Transform 2022 offers three concentrated events over an action-packed two weeks. The first event—The Data & AI Executive Summit—provides a leadership lens on real-world experiences and successes in applying data and AI. The following two events—The Data Week and The AI & Edge Week—dive deep into the most relevant topics in data science across industry tracks like retail, finance, and healthcare. Transform’s approach to showcasing various industry use cases is one of the reasons this is such a great conference. I find hearing how other industries are practically applying AI can help you find unique solutions to challenges in your own industry.

        Ella Hilal, VP of Data Science

        RecSys 2022

        When: September 18-23
        Where: Virtual & In-person (Seattle)

        RecSys is a conference dedicated to sharing the latest developments, techniques, and use cases in recommender systems. The conference has both a research track and industry track, allowing for different types of talks and perspectives from the field. The industry track is particularly interesting, since you get to hear about real-world recommender use cases and challenges from leading companies. Expect talks and workshops to be centered around applications of recommender systems in various settings (fashion, ecommerce, media, etc), reinforcement learning, evaluation and metrics, and bias and fairness.

        Chen Karako, Data Scientist Lead

        ODSC West

        When: November 1-3
        Where: Virtual or in-person (San Francisco)

        ODSC West is a great opportunity to connect with the larger data science community and contribute your ideas to the open source ecosystem. Attend to hear keynotes on topics like machine learning, MLOps, natural language processing, big data analytics, and new frontiers in research. On top of in-depth technical talks, there’s a pre-conference bootcamp on programming, mathematics or statistics, a career expo, and opportunities to connect with experts in the industry.

        Ryan Harter, Senior Staff Data Scientist

        In-person Events

        PyData London 2022

        When: June 17-19
        Where: London

        PyData is a community of data scientists and data engineers who use and develop a variety of open source data tools. The organization has a number of events around the world, but PyData London is one of its larger events. You can expect the first day to be tutorials providing walkthroughs on methods like data validation and training object detection with small datasets. The remaining two days are filled with talks on practical topics like solving real-world business problems with Bayesian modeling. Catch Shopify at this year’s PyData London event.

        Micayla Wood, Data Brand & Comms Senior Manager

        KDD 2022

        When: August 14-18
        Where: Washington D.C.

        KDD is a highly research-focused event and home to some of the data industry’s top innovations like personalized advertising and recommender systems. Keynote speakers range from academics to industry professionals, and policy leaders, and each talk is accompanied by well-written papers for easy reference. I find this is one of the best conferences to keep in touch with the industry trends and I’ve actually applied what I learned from attending KDD to the work that I do.

        Vincent Chio, Data Science Manager

        Re-Work Deep Learning Summit

        When: November 9-10
        Where: Toronto

        The Re-Work Deep Learning Summit focuses on showcasing the learnings from the latest advancements in AI and how businesses are applying them in the real world. With content focusing on areas like computer vision, pattern recognition, generative models, and neural networks, it was great to hear speakers not only share successes, but also the challenges that still lie in the application of AI. While I’m personally not a machine learning engineer, I still found the content approachable and interesting, especially the talks on how to set up practical ETL pipelines for machine learning applications.

        Ehsan K. Asl, Senior Data Engineer


        When: October 3-5
        Where: Budapest

        Crunch is a data science conference focused on sharing the industry’s top use cases for using data to scale a business. With talks and workshops around areas like the latest data science trends and tools, how to build effective data teams, and machine learning at scale, Crunch has great content and opportunities to meet other data scientists from around Europe. With their partner conferences (Impact and Amuse) you also have a chance to listen in to interesting and relevant product and UX talks.

        Yizhar (Izzy) Toren, Senior Data Scientist

        Rebekah Morgan is a Copy Editor on Shopify's Data Brand & Comms team.

        Are you passionate about data discovery and eager to learn more, we’re always hiring! Visit our Data Science and Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by design.

        Continue reading

        Building a Form with Polaris

        Building a Form with Polaris

        As companies grow in size and complexity, there comes a moment in time when teams realize the need to standardize portions of their codebase. This most often becomes the impetus for creating a common set of processes, guidelines, and systems, as well as a standard set of reusable UI elements. Shopify’s Polaris design system was born from such a need. It includes design and content guidelines along with a rich set of React components for the UI that Shopify developers use to build the Shopify admin, and that third-party app developers can use to create apps on the Shopify App Store.

        Working with Polaris

        Shopify puts a great deal of effort into Polaris and it’s used quite extensively within the Shopify ecosystem, by both internal and external developers, to build UIs with a consistent look and feel. Polaris includes a set of React components used to implement functionality such as lists, tables, cards, avatars, and more.

        Form… Forms… Everywhere

        Polaris includes 16 separate components that not only encompass the form element itself, but also the standard set of form inputs such as checkbox, color or date picker, radio button, and text field to name just a few. Here are a few examples of basic form design in Shopify admin related to creating a Product or Collection.

        Add Product and Create Collection Forms
        Add Product and Create Collection Forms

        Additional forms you’ll encounter in the Shopify admin also include creating an Order or Gift Card.

        Create Order and Issue Gift Card Forms
        Create Order and Issue Gift Card Forms

        From these form examples, we see a clear UI design implemented using reusable Polaris components.

        What We’re Building

        In this tutorial, we’ll build a basic form for adding a product that includes Title and Description fields, a Save button, and a top navigation that allows the user to return to the previous page.

        Add Product Form
        Add Product Form

        Although our initial objective is to create the basic form design, this tutorial is also meant as an introduction to React, so we’ll also add the additional logic such as state, events, and event handlers to emulate a fully functional React form. If you’re new to React then this tutorial is a great introduction into many of the fundamental underlying concepts of this library.

        Starter CodeSandbox

        As this tutorial is a step by step guide, providing all the code snippets along the way, we highly encourage you to fork this Starter CodeSandbox and code along with us. “Knowledge” is in the understanding but the “Knowing” is only gained in its application.

        If you hit any roadblocks along the way, or just prefer to jump right into the solution code, heres the Solution CodeSandbox.

        Initial Code Examination

        The codebase we’re working with is a basic create-react-app. We’ll use the PolarisForm component to build our basic form using Polaris components and then add standard React state logic. The starter code also imports the @shopify/polaris library which can be seen in the dependencies section.

        Initial Setup in CodeSandbox
        Initial Setup

        One other thing to note about our component structure is that the component folders contain both an index.js file and a ComponentName.js file. This is a common pattern visible throughout the Polaris component library, such as the Avatar component in the example below.

        Polaris Avatar Component Folder
        Polaris Avatar Component Folder

        Let’s first open the PolarisForm component. We can see that its a bare bones component that outputs some initial text just so that we know everything is working as expected.

        Choosing Components

        Choosing our Polaris components will seem intuitive but, at times, also requires a deeper understanding of the Polaris library which the tutorial is meant to introduce along the way. Here’s a list of the components we’ll include in our design:


        The actual form


        To apply a bit of styling between the fields


        The text inputs for our form


        To provide the back arrow navigation and Save button


        To apply a bit of styling around the form

        Reviewing the Polaris Documentation

        When working with any new library, it’s always best to examine the documentation or as developers like to say, RTFM. With that in mind we’ll review the Polaris documentation along the way, but for now let’s start with the Form component.

        The short description of the Form component describes it as “a wrapper component that handles the submissions of forms.” Also, in order to make it easier to start working with Polaris, each component provides best practices, related components, and several use case examples along with a corresponding CodeSandbox. The docs provide explanations for all the additional props that can be passed to the component.

        Adding Our Form

        It’s time to dive in and build out the form based on our design. The first thing we need to do is import the components we’ll be working with into the PolarisForm.js file.

        import { Form, FormLayout, TextField, Page, Card } from "@shopify/polaris";

        Now let’s render the Form and TextField components. I’ve also gone ahead and included the following TextField props: label, type, and multiline.

        So it seems our very first introduction into our Polaris journey is the following error message:

        No i18n was provided. Your application must be wrapped in an <AppProvider> component. See for implementation instructions.

        Although the message also provides a link to the AppProvider component, I’d suggest we take a few minutes to read the Get Started section of the documentation. We see there’s a working example of rendering a Button component that’s clearly wrapped in an AppProvider.

        And if we take a look at the docs for the AppProvider component it states it's “a required component that enables sharing global settings throughout the hierarchy of your application.”

        As we’ll see later, Polaris creates several layers of shared context which are used to manage distinct portions of the global state. One important feature of the Shopify admin is that it supports up to 20 languages. The AppProvider is responsible for sharing those translations to all child components across the app.

        We can move past this error by importing the AppProvider and replacing the existing React Fragment (<>) in our App component.

        The MissingAppProviderError should now be a thing of the past and the form renders as follows:

        Rendering The Initial Form
        Rendering The Initial Form

        Examining HTML Elements

        One freedom developers allow themselves when developing code is to be curious. In this case, my curiosity drives me towards viewing what the HTML elements actually look like in developer tools once rendered.

        Some questions that come to mind are: “how are they being rendered” and “do they include any additional information that isn’t visible in the component.” With that in mind, let’s take a look at the HTML structure in developer tools and see for ourselves.

        Form Elements In Developer Tools
        Form Elements In Developer Tools

        At first glance it’s clear that Polaris is prefixed to all class and ID names. There are a few more elements to the HTML hierarchy as some of the elements are collapsed, but please feel free to pursue your own curiosity and continue to dig a bit deeper into the code.

        Working With React Developer Tools

        Another interesting place to look as we satisfy our curiosity is the Components tab in Dev Tools. If you don’t see it then take a minute to install the React Developer Tools Chrome Extension. I’ve highlighted the Form component so we can focus our attention there first and see all the component hierarchy of our form. Once again, we’ll see there’s more being rendered than what we imported and rendered into our component.

        Form Elements In Developer Tools
        Form Elements In Developer Tools

        We also can see that at the very top of the hierarchy, just below App, is the AppProvider component. Being able to see the entire component hierarchy provides some context into how many nested levels of context are being rendered. 

        Context is a much deeper topic in React and is meant to allow child components to consume data directly instead of prop drilling. 

        Form Layout Component

        Now that we’ve allowed ourselves the freedom to be curious, let’s refocus ourselves on implementing the form. One thing we might have noticed in the initial layout of the elements is that there’s no space between the Title input field and the Description label. This can be easily fixed by wrapping both TextField components in a FormLayout component. 

        If we take a look at the documentation, we see that the FormLayout component is “used to arrange fields within a form using standard spacing and that by default it stacks fields vertically but also supports horizontal groups of fields.”

        Since spacing is what we needed in our design, lets include the component.

        Once the UI updates, we see that it now includes the additional space needed to meet our design specs.

        Form With Spacing Added
        Form With Spacing Added

        With our input fields in place, let’s now move onto adding the back arrow navigation and the Save button to the top of our form design.

        The Page Component

        This is where Polaris steps outside the bounds of being intuitive and requires that we do a little digging into the documentation, or better yet the HTML. Since we know that we’re rebuilding the Add Product form, then perhaps we take a moment to once again explore our curiosity and take a look at the actual Form component in Shopify admin in the Chrome Dev Tools.

        Polaris Page Component As Displayed In HTML
        Polaris Page Component As Displayed In HTML

        If we highlight the back arrow in HTML it highlights several class names prefixed with Polaris-Page. It looks like we’ve found a reference to the component we need, so now it’s off to the documentation to see what we can find.

        Located under the Structure category in the documentation, there’s a component called Page. The short description for the Page component states that it’s “used to build the outer-wrapper of a page, including the page title and associated actions.” The assumption is that title is used for the Add Product text and the associated action includes the Save button.

        Let’s give the rest of the documentation a closer look to see what props implement that functionality. Taking a look at the working example, we can correlate it with the following props:


        Adds the back arrow navigation


        Adds the title to the right of the navigation


        Adds the button which will include a event listener


        With props in hand, let’s add the Page component along with its props and set their values accordingly.

        Polaris Page Component As Displayed In HTML
        Page Component With Props

        The Card Component

        We're getting so close to completing the UI design, but based on a side by side comparison, it still needs a bit of white space around the form elements.  Of course we could opt to add our own custom CSS, but Polaris provides us a component that achieves the same results.

        Comparing Our Designs
        Comparing Our Designs

        If we take a look at the Shopify admin Add Products form in Dev Tools, we see a reference to the Card component, and it appears to be using padding (green outline in the image below) to create the space.

        Card Component Padding
        Card Component Padding

        Let’s also take a look at the documentation to see what the Card component brings to the table. The short description for the Card component states that it is “used to group similar concepts and tasks together to make Shopify easier for merchants to scan, read and get things done.”  Although it makes no mention of creating space either via padding or margin, if we look at the example provided, we see that it contains a prop called sectioned and the docs state the prop it used to “auto wrap content in a section.” Feel free to toggle the True/False buttons to confirm this does indeed create the spacing we’re looking for.

        Let’s add the Card component and include the sectioned prop.

        It seems our form has finally taken shape and our UI design is complete.

        Final Design
        Final Design

        Add the Logic

        Although our form is nice to look at, it doesnt do very much at this point. Typing into the fields doesn’t capture any input and clicking the Save button does nothing. The only feature that appears to function is clicking on the back button.

        If youre new to React then this section introduces some of the fundamental elements of React such as state, events, and event handlers.

        In React, the Forms can be configured as Controlled or Uncontrolled. If you google either concept you’ll find many articles that describe the differences and use cases for either approach. In our example we will configure the form as a Controlled form. What that means is that we’ll capture every keystroke and re-render the input in its current state.

        Adding State and Handler Functions

        Since we will be capturing every keystroke in two separate input fields we’ll opt to instantiate two instances of state. The first thing we need to do is import the useState Hook.

        import { useState } from "react";

        Now we can instantiate two unique instances of state called title and description. Instantiating state requires that we create both a state value and a setState function. React is very particular about state and requires that any updates to the state value use the setState function.

        With our state instantiated, lets create our event handler functions that manages all updates to the state. Handler functions aren’t required, however, they represent a React best practice in that developers expect a handler function as part of the convention and many times additional logic needs to take place before state is updated.

        Since we have two state values to update, we create two separate handler functions for each one, but being that we’re also implementing a form, we also need an event handler to manage when the form is submitted.

        Adding Events 

        We’re almost there. Now that state and our event handler functions are in place, it’s time to add the events and assign them the corresponding functions. The two events that we add are: onChange and onSubmit.

        Let’s start with adding the onChange event to both of the TextField components. Not only will we need to add the event, but also, being that we are implementing a Controlled form, need to include the value prop and assign the value to its corresponding state value.

        Take a moment to confirm that the form is capturing input by typing into the fields. If it works then were good to go.

        The last event we’ll add is the onSubmit event. Our logic dictates that the form would only be submitted once the user clicks on the Save button, so that’s where we’ll add the event logic.

        If we take a look at the documentation for the Page component we see that it includes an onAction prop. Although the documentation doesn’t go any further than providing an example we assume that’s the prop we use to trigger the onSubmit function.

        Of course let’s confirm that everything is now tied together by clicking on the Save button. If everything worked we should see the following console log output:

        SyntheticBaseEvent {_reactName: "onClick", _targetInst: null, type: "click", nativeEvent: PointerEvent, target: HTMLSpanElement...}

        Clearing the Form

        The very last step in our form submission process is to clear the form fields so that the merchant has a clean slate if they choose to add another product.

        This tutorial was meant to introduce you to Polaris React components available in Shopify’s Polaris Design System. The library provides a robust set of components that have been meticulously designed by our UX teams and implemented by our Development teams. The Polaris github library is open source, so feel free to look around or set up the local development environment (which uses Storybook).

        Joe Keohan is an RnD Technical Facilitator responsible for onboarding our new hire engineering developers. He has been teaching and educating for the past 10 years and is passionate about all things tech. Feel free to reach out on LinkedIn and extend your network or to discuss engineering opportunities at Shopify! When he isn’t leading sessions, you’ll find Joe jogging, surfing, coding and spending quality time with his family.

        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        To Thread or Not to Thread: An In-Depth Look at Ruby’s Execution Models

        To Thread or Not to Thread: An In-Depth Look at Ruby’s Execution Models

        Deploying Ruby applications using threaded servers has become widely considered as standard practice in recent years. According to the 2022 Ruby on Rails community survey, in which over 2,600 members of the global Rails community responded to a series of questions regarding their experience using Rails, threaded web servers such as Puma are by far the most popular deployment target. Similarly when it comes to job processors, the thread-based Sidekiq seems to represent the majority of the deployments. 

        In this post, I’ll explore the mechanics and reasoning behind this practice and share knowledge and advice to help you make well-informed decisions on whether or not you should utilize threads in your applications (and to that point—how many). 

        Why Are Threads the Popular Default?

        While there are certainly many different factors for threaded servers' rise in popularity, their main selling point is that they increase an application’s throughput without increasing its memory usage too much. So to fully understand the trade-offs between threads and processes, it’s important to understand memory usage.

        Memory Usage of a Web Application

        Conceptually, the memory usage of a web application can be divided in two parts.

        Two separate text boxes stacked on top of each other, the top one containing the words "Static memory" and the bottom containing the words "Processing memory".
        Static memory and processing memory are the two key components of memory usage in a web application.

        The static memory is all the data you need to run your application. It consists of the Ruby VM itself, all the VM bytecode that was generated while loading the application, and probably some static Ruby objects such as I18n data, etc. This part is like a fixed cost, meaning whether your server runs 1 or 10 threads, that part will stay stable and can be considered read-only.

        The request processing memory is the amount of memory needed to process a request. There you'll find database query results, the output of rendered templates, etc. This memory is constantly being freed by the garbage collector and reused, and the amount needed is directly proportional to the amount of thread your application runs.

        Based on this simplified model, we express the memory usage of a web application as:

        processes * (static_memory + (threads * processing_memory))

        So if you have only 512MiB available, with an application using 200MiB of static memory and needing 150MiB of processing memory, using two single threaded processes requires 700MiB of memory, while using a single process with two threads will use only 500MiB and fit in a Heroku dyno.

        Two columns of text boxes next to each other. On the left, a column representing a single process shows a box with the text "Static Memory" at the top, and two boxes with the text "Thread #2 Processing Memory beneath it. In the column on the right which represents two single threaded processes, there are four boxes, which read: "Process 1 Static Memory", "Process #1 Processing Memory", "Process #2 Static Memory", and "Process #2 Processing memory" in order from top to bottom.
        A single process with two threads uses less memory than two single threaded processes.

        However this model, like most models, is a simplified depiction of reality. Let’s bring it closer to reality by adding another layer of complexity: Copy on Write (CoW).

        Enter Copy on Write

        CoW is a common resource management technique involving sharing resources rather than duplicating them until one of the users needs to alter it, at which point the copy actually happens. If the alteration never happens, then neither does the copy.

        In old UNIX systems of the ’70s or ’80s, forking a process involved copying its entire addressable memory over to the new process address space, effectively doubling the memory usage. But since the mid ’90s, that’s no longer true as, most, if not all, fork implementations are now sophisticated enough to trick the processes into thinking they have their own private memory regions, while in reality they’re sharing it with other processes.

        When the child process is forked, its pages tables are initialized to point to the parent’s memory pages. Later on, if either the parent or the child tries to write in one of these pages, the operating system is notified and will actually copy the page before it’s modified.

        This means that if neither the child nor the parent write in these shared pages after the fork happens, forked processes are essentially free.

        A flow chart with "Parent Process Static Memory" in a text box at the top. On the second row, there are two text boxes containing the text "Process 1 Processing Memory" and "Process 2 Processing Memory", connected to the top text box with a line to illustrate resource sharing by forking of the parent process.
        Copy on Write allows for sharing resources by forking the parent process.

        So in a perfect world, our memory usage formula would now be:

        static_memory + (processes * threads * processing_memory)

        Meaning that threads would have no advantage at all over processes.

        But of course we're not in a perfect world. Some shared pages will likely be written into at some point, the question is how many? To answer this, we’ll need to know how to accurately measure the memory usage of an application.

        Beware of Deceiving Memory Metrics

        Because of CoW and other memory sharing techniques, there are now many different ways to measure the memory usage of an application or process. Depending on the context, some metrics can be more or less relevant.

        Why RSS Isn’t the Metric You’re Looking For

        The memory metric that’s most often shown by various administration tools, such as ps, is Resident Set Size (RSS). While RSS has its uses, it's really misleading when dealing with forking servers. If you fork a 100MiB process and never write in any memory region, RSS will report both processes as using 100MiB. This is inaccurate because 100MiB is being shared between the two processes—the same memory is being reported twice.

        A slightly better metric is Proportional Set Size (PSS). In PSS, shared memory region sizes are divided by the number of processes sharing them. So our 100MiB process that was forked once should actually have a PSS of 50MiB. If you’re trying to figure out whether you’re nearing memory exhaustion, this is already a much more useful metric to look at because if you add up all the PSS numbers you get how much memory is actually being used—but we can go even deeper.

        On Linux, you can get a detailed breakdown of a process memory usage though cat /proc/$PID/smaps_rollup. Here’s what it looks like for a Unicorn worker on one of our apps in production:

        And for the parent process:

        Let’s unpack what each element here means. First, the Shared and Private fields. As its name suggests, Shared memory is the sum of memory regions that are in use by multiple processes. Whereas Private memory is allocated for a specific process and isn’t shared by other processes. In this example, we see that out of the 771,912 kB of addressable memory only 437,928 kB (56.7%) are really owned by the Unicorn worker, the rest is inherited from the parent process.

        As for Clean and Dirty, Clean memory is memory that has been allocated but never written to (things like the Ruby binary and various native shared libraries). Dirty memory is memory that has been written into by at least one process. It can be shared as long as it was only written into by the parent process before it forked its children.

        Measuring and Improving Copy on Write Efficiency

        We’ve established that shared memory is a key to maximizing efficiency of processes, so the important question here is how much of the static memory is actually shared. To approximate this, we compare the worker shared memory with the parent process RSS, which is 508,544 kB in this app, so:

        worker_shared_mem / master_rss
        >>(18288 + 315648) / 508544.0 * 100

        Here we see that about two-thirds of the static memory is shared:

        A flow chart depicting worker shared memory, with Private and Parent Process Shared Static Memory in text boxes at the top, connecting to two separate columns, each containing Private Static Memory and Processing Memory.
        By comparing the worker shared memory with the parent process RSS, we can see that two thirds of this app’s static memory is shared.

        If we were looking at RSS, we’d think each extra worker would cost ~750MiB, but in reality it’s closer to ~427MiB, when an extra thread would cost ~257MiB. That’s still noticeably more, but far less than what the initial naive model would have predicted.

        There’s a number of ways an application owner can improve CoW efficiency with the general idea being to load as many things as possible as part of the boot process before the server forks. This topic is very broad and could be a whole post by itself, but here are a few quick pointers.

        The first thing to do is configure the server to fully load the application. Unicorn, Puma, and Sidekiq Enterprise all have a preload_app option for that purpose. Once that’s done, a common pattern that degrades CoW performance is memoized class variables, for example:

        Such delayed evaluation both prevents that memory from being shared and causes a slowdown for the first request to call this method. The simple solution is to instead use a constant, but when it’s not possible, the next best thing is to leverage the Rails eager_load_namespaces feature, as shown here:

        Now, locating these lazy loaded constants is the tricky part. Ruby heap-profiler is a useful tool for this. You can use it to dump the entire heap right after fork, and then after processing a few requests, see how much the process has grown and where these extra objects were allocated.

        The Case for Process-Based Servers

        So, while there are increased memory costs involved in using process-based servers, using more accurate memory metrics and optimizations like CoW to share memory between processes can alleviate some of this. But why use process-based servers such as Unicorn or Resque at all, given the increased memory cost? There are actually advantages to process-based servers that shouldn’t be overlooked, so let’s go through those. 

        Clean Timeout Mechanism

        When running large applications, you may run into bugs that cause some requests to take much longer than desirable. There could be many reasons for that—they might be specifically crafted by a malicious actor to try to DOS your service, or they might be processing an unexpectedly large amount of data. When this occurs, being able to cleanly interrupt this request is paramount for resiliency. Process-based servers can kill the worker process and fork a fresh one to replace it, ensuring the request is cleanly interrupted.

        Threads, however, can’t be interrupted cleanly. Since they directly share mutable resources with other threads, if you attempt to kill a single thread, you may leave some resources such as mutexes or database connections in an unrecoverable state, causing the other threads to run into various unrecoverable errors. 

        The Black Box of Global VM Lock Latency

        Improved latency is another major advantage of processes over threads in Ruby (and other languages with similar constraints such as Python). A typical web application process will do two types of work: CPU and I/O. So two Ruby process might look like this:

        Two rows of text boxes, containing separate boxes with the text "IO", "CPU", and "GC", representing the work of processes in a Ruby web application.
        CPU and IOs in two processes in a Ruby application.

        But in a Ruby process, because of the infamous Global VM Lock (GVL) , only one thread at a time can execute Ruby code, and when the garbage collector (GC) triggers, all threads are paused. So if we were to use two threads, the picture may instead look like this:

        Two rows of text boxes, with the individual boxes containing the text "CPU", "IO", "GVL wait" and GC", representing the work of threads in a Ruby web application and the latency introduced by the GVL.
        Global VM Lock (GVL) increases latency in Ruby threads.

        So every time two threads need to execute Ruby code at the same time, the service latency increases. How much this happens varies considerably from one application to another and even from one request to another. If you think about it, to fully saturate a process with N threads an application only needs to spend less than 1 / N of its time waiting on I/O. So 50 percent I/O for two threads, 75 percent I/O for four threads, etc. And that’s only the saturation limit, given that a request’s use of I/O and CPU is very much unpredictable, an application doing 75 percent I/O with two threads will still frequently wait on the GVL.

        The common wisdom in the Ruby community is that Ruby applications are relatively I/O heavy, but from my experience it’s not quite true, especially once you consider that GC pauses do acquire the GVL too, and Ruby applications tend to spend quite a lot of time in GC.

        Web applications are often specifically crafted to avoid long I/O operations in the web request cycle. Any potentially slow or unreliable I/O operation like calling a third-party API or sending an email notification is generally deferred to a background job queue, so the remaining I/O in web requests are mostly reasonably fast database and cache queries. A corollary is that the job processing side of applications tends to be much more I/O intensive than the web side. So job processors like Sidekiq can more frequently benefit from a higher thread count. But even for web servers, using threads can be seen as a perfectly acceptable tradeoff between throughput per dollar and latency. 

        The main problem is that as of today there isn’t really a good way to measure how much the service latency is impacted by the GVL, so service owners are left in the dark. Since Ruby doesn’t provide any way to instrument the GVL, all we’re left with are proxy metrics, like gradually increasing or decreasing the number of threads and measuring the impact on the latency metrics, but that’s far from enough.

        That’s why I recently put together a feature request and a proof of concept implementation for Ruby 3.2 to provide a GVL instrumentation API . It's a really low-level and hard to use API, but if it’s accepted I plan to publish a gem to expose simple metrics to know exactly how much time is spent waiting for the GVL, and I hope application performance monitoring services include it.

        Ractors and Fibers—Not a Silver Bullet Solution

        In the last few years, the Ruby community has been experimenting heavily with other concurrency constructs to potentially replace threads, known as Ractors and Fibers.

        Ractors can execute Ruby code in parallel, rather than having one single GVL, each Ractor has its own lock, so they theoretically could be game changing. However Ractors can’t share any global mutable state, so even sharing a database connection pool or a logger between Ractors isn’t possible. That’s a major architectural challenge that would require most libraries to be heavily refactored, and the result would likely not be as usable. I hope to be proven wrong, but I don’t expect Ractors to be used as units of execution for sizable web applications any time soon. 

        As for Fibers, they’re essentially lighter threads that are cooperatively scheduled. So everything said in the previous sections about threads and the GVL applies to them as well. They’re very well suited for I/O intensive applications that mostly just move bytes streams around and don’t spend much time executing code, but any application that doesn’t benefit from more than a handful of threads won’t benefit from using fibers.

        YJIT May Change the Status Quo

        While it’s not yet the case, the advent of YJIT may significantly increase the need to run threaded servers in the future. Since just-in-time (JIT) compilers speedup code execution at the expense of unshareable memory usage, JITing Ruby will decrease CoW performance, but will also make applications proportionally more I/O intensive.

        Right now, YJIT only offers modest speed gains, but if in the future it manages to provide even a two times speedup, it would certainly allow application owners to ramp up their number of web threads by as much to compensate for the increased memory cost.

        Tips to Remember

        Ultimately choosing between process versus thread-based servers involves many trade-offs, so it’s unreasonable to recommend either without first looking at an application’s metrics.

        But in the abstract, here are a few quick takeaways to keep in mind: 

        • Always enable application preloading to benefit from CoW as much as possible. 
        • Unless your application fits on the smallest offering or your hosting provider, use a smaller number of larger containers instead of a bigger number of smaller containers. For instance a single 4CPU 2GiB of RAM box is more efficient than 4 boxes with 1CPU 512MiB of RAM each.
        • If latency is more important to you than keeping costs low, or if you have enough free memory for it, use Unicorn to benefit from the reliable request timeout. 
          • Note: Unicorn must be protected from slow client attacks by a reverse proxy that buffer requests. If that’s a problem, Puma can be configured to run with a single thread per worker.
        • If using threads, start with only two threads unless you’re confident your application is indeed spending more than half its time waiting on I/O operations. This doesn’t apply to job processors since they tend to be much more I/O intensive and are much less latency sensitive, so they can easily benefit from higher thread counts. 

        Looking Ahead: Future Improvements to the Ruby Ecosystem

        We’re exploring a number of avenues to improve the situation for both process and thread-based servers.

        First, there’s the GVL instrumentation API mentioned previously that should hopefully allow application owners to make more informed trade-offs between throughput and latency. We could even try to use it to automatically apply backpressure by dynamically adjusting concurrency when GVL contention is over some threshold.

        Additionally, threaded web servers could theoretically implement a reliable request timeout mechanism. When a request takes longer than expected, they could stop forwarding requests to the impacted worker and wait for all other requests to either complete or timeout before killing the worker and reforking it. That’s something Matthew Draper explored a few years ago and that seems doable.

        Then, the CoW performance of Ruby itself could likely be improved further. Several patches have been merged for this purpose over the years, but we can probably do more. Notably we suspect that Ruby’s inline caches cause most of the VM bytecode to be unshared once it’s executed. I think we could also take some inspiration from what the Instagram engineering team did to improve Python’s CoW performance . For instance they introduced a gc.freeze() method that instructs the GC that all existing memory regions will become shared. Python uses this information to make more intelligent decisions around memory usage, like not using any free slots in these shared regions since it’s more efficient to allocate a new page than to dirty an old one.

        Jean Boussier is a Rails Core team member, Ruby committer, and Senior Staff Engineer on Shopify's Ruby and Rails infrastructure team. You can find him on GitHub as @byroot or on Twitter at @_byroot.

        If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by design.

        Continue reading

        Implementing Equality in Ruby

        Implementing Equality in Ruby

        Ruby is one of the few programming languages that get equality right. I often play around with other languages, but keep coming back to Ruby. This is largely because Ruby’s implementation of equality is so nice.

        Nonetheless, equality in Ruby isn't straightforward. There is #==, #eql?, #equal?, #===, and more. Even if you’re familiar with how to use them, implementing them can be a whole other story.

        Let's walk through all forms of equality in Ruby and how to implement them.

        Why Properly Implementing Equality Matters

        We check whether objects are equal all the time. Sometimes we do this explicitly, sometimes implicitly. Here are some examples:

        • Do these two Employees work in the same Team? Or, in code: ==
        • Is the given DiscountCode valid for this particular Product? Or, in code: product.discount_codes.include?(given_discount_code).
        • Who are the (distinct) managers for this given group of employees? Or, in code:

        A good implementation of equality is predictable; it aligns with our understanding of equality.

        An incorrect implementation of equality, on the other hand, conflicts with what we commonly assume to be true. Here is an example of what happens with such an incorrect implementation:

        The geb and geb_also objects should definitely be equal. The fact that the code says they’re not is bound to cause bugs down the line. Luckily, we can implement equality ourselves and avoid this class of bugs.

        No one-size-fits-all solution exists for an equality implementation. However, there are two kinds of objects where we do have a general pattern for implementing equality: entities and value objects. These two terms come from domain-driven design (DDD), but they’re relevant even if you’re not using DDD. Let’s take a closer look.


        Entities are objects that have an explicit identity attribute. Often, entities are stored in some database and have a unique id attribute corresponding to a unique id table column. The following Employee example class is such an entity:

        Two entities are equal when their IDs are equal. All other attributes are ignored. After all, an employee’s name might change, but that does not change their identity. Imagine getting married, changing your name, and not getting paid anymore because HR has no clue who you are anymore!

        ActiveRecord, the ORM that is part of Ruby on Rails, calls entities "models" instead, but they’re the same concept. These model objects automatically have an ID. In fact, ActiveRecord models already implement equality correctly out of the box!

        Value Objects

        Value objects are objects without an explicit identity. Instead, their value as a whole constitutes identity. Consider this Point class:

        Two Points will be equal if their x and y values are equal. The x and y values constitute the identity of the point.

        In Ruby, the basic value object types are numbers (both integers and floating-point numbers), characters, booleans, and nil. For these basic types, equality works out of the box:

        Arrays of value objects are in themselves also value objects. Equality for arrays of value objects works out of the box—for example, [17, true] == [17, true]. This might seem obvious, but this isn’t true in all programming languages.

        Other examples of value objects are timestamps, date ranges, time intervals, colors, 3D coordinates, and money objects. These are built from other value objects; for example, a money object consists of a fixed-decimal number and a currency code string.

        Basic Equality (Double Equals)

        Ruby has the == and != operators for checking whether two objects are equal or not:

        Ruby’s built-in types all have a sensible implementation of ==. Some frameworks and libraries provide custom types, which will have a sensible implementation of ==, too. Here is an example with ActiveRecord:

        For custom classes, the == operator returns true if and only if the two objects are the same instance. Ruby does this by checking whether the internal object IDs are equal. These internal object IDs are accessible using #__id__. Effectively, gizmo == thing is the same as gizmo.__id__ == thing.__id__.

        This behavior is often not a good default, however. To illustrate this, consider the Point class from earlier:

        The == operator will return true only when calling it on itself:

        This default behavior is often undesirable in custom classes. After all, two points are equal if (and only if) their x and y values are equal. This behavior is undesirable for value objects (such as Point) and entities (such as the Employee class mentioned earlier).

        The desired behavior for value objects and entities is as follows:

        Image showing the desired behavior for value objects and entities. The first pairing for value objects checks if x and y (all attributes) are equal. The second pair for entities, checks whether the id attributes are equal. The third pair shows the default ruby check, which is whether internal object ids are equal
        • For value objects (a), we’d like to check whether all attributes are equal.
        • For entities (b), we’d like to check whether the explicit ID attributes are equal.
        • By default (c), Ruby checks whether the internal object IDs are equal.

        Instances of Point are value objects. With the above in mind, a good implementation of == for Point would look as follows:

        This implementation checks all attributes and the class of both objects. By checking the class, checking equality of a Point instance and something of a different class return false rather than raise an exception.

        Checking equality on Point objects now works as intended:

        The != operator works too:

        A correct implementation of equality has three properties: reflexivity, symmetry, and transitivity.

        Image with simple circles to describe the implementation of equality having three properties: reflexivity, symmetry, and transitivity, described below the image for more context
        • Reflexivity (a): An object is equal to itself: a == a
        • Symmetry (b): If a == b, then b == a
        • Transitivity (c): If a == b and b == c, then a == c

        These properties embody a common understanding of what equality means. Ruby won’t check these properties for you, so you’ll have to be vigilant to ensure you don’t break these properties when implementing equality yourself.

        IEEE 754 and violations of reflexivity

        It seems natural that something would be equal to itself, but there is an exception. IEEE 754 defines NaN (Not a Number) as a value resulting from an undefined floating-point operation, such as dividing 0 by 0. NaN, by definition, is not equal to itself. You can see this for yourself:

        This means that == in Ruby is not universally reflexive. Luckily, exceptions to reflexivity are exceedingly rare; this is the only exception I am aware of.

        Basic Equality for Value Objects

        The Point class is an example of a value object. The identity of a value object, and thereby equality, is based on all its attributes. That is exactly what the earlier example does:

        Basic Equality for Entities

        Entities are objects with an explicit identity attribute, commonly @id. Unlike value objects, an entity is equal to another entity if and only if their explicit identities are equal.

        Entities are uniquely identifiable objects. Typically, any database record with an id column corresponds to an entity. Consider the following Employee entity class:

        Other forms of ID are possible too. For example, books have an ISBN, and recordings have an ISRC. But if you have a library with multiple copies of the same book, then ISBN won’t uniquely identify your books anymore.

        For entities, the == operator is more involved to implement than for value objects:

        This code does the following:

        • The super call invokes the default implementation of equality: Object#==. On Object, the #== method returns true if and only if the two objects are the same instance. This super call, therefore, ensures that the reflexivity property always holds.
        • As with Point, the implementation Employee#== checks class. This way, an Employee instance can be checked for equality against objects of other classes, and this will always return false.
        • If @id is nil, the entity is considered not equal to any other entity. This is useful for newly-created entities which have not been persisted yet.
        • Lastly, this implementation checks whether the ID is the same as the ID of the other entity. If so, the two entities are equal.

        Checking equality on entities now works as intended:

        Blog post of Theseus

        Implementing equality on entity objects isn’t always straightforward. An object might have an id attribute that doesn’t quite align with the object’s conceptual identity.

        Take a BlogPost class, for example, with id, title, and body attributes. Imagine creating a BlogPost, then halfway through writing the body for it, scratching everything and starting over with a new title and a new body. The id of that BlogPost will still be the same, but is it still the same blog post?

        If I follow a Twitter account that later gets hacked and turned into a cryptocurrency spambot, is it still the same Twitter account?

        These questions don’t have a proper answer. That’s not surprising, as this is essentially the Ship of Theseus thought experiment. Luckily, in the world of computers, the generally accepted answer seems to be yes: if two entities have the same id, then the entities are equal as well.

        Basic Equality with Type Coercion

        Typically, an object is not equal to an object of a different class. However, this isn’t always the case. Consider integers and floating-point numbers:

        Here, float_two is an instance of Float, and integer_two is an instance of Integer. They are equal: float_two == integer_two is true, despite different classes. Instances of Integer and Float are interchangeable when it comes to equality.

        As a second example, consider this Path class:

        This Path class provides an API for creating paths:

        The Path class is a value object, and implementing #== could be done just as with other value objects:

        However, the Path class is special because it represents a value that could be considered a string. The == operator will return false when checking equality with anything that isn’t a Path:

        It can be beneficial for path == "/usr/bin/ruby" to be true rather than false. To make this happen, the == operator needs to be implemented differently:

        This implementation of == coerces both objects to Strings, and then checks whether they are equal. Checking equality of a Path now works:

        This class implements #to_str, rather than #to_s. These methods both return strings, but by convention, the to_str method is only implemented on types that are interchangeable with strings.

        The Path class is such a type. By implementing Path#to_str, the implementation states that this class behaves like a String. For example, it’s now possible to pass a Path (rather than a String) to, and it will work because accepts anything that responds to #to_str.

        String#== also uses the to_str method. Because of this, the == operator is reflexive:

        Strict Equality

        Ruby provides #equal? to check whether two objects are the same instance:

        Here, we end up with two String instances with the same content. Because they are distinct instances, #equal? returns false, and because their content is the same, #== returns true.

        Do not implement #equal? in your own classes. It isn’t meant to be overridden. It’ll all end in tears.

        Earlier in this post, I mentioned that #== has the property of reflexivity: an object is always equal to itself. Here is a related property for #equal?:

        Property: Given objects a and b. If a.equal?(b), then a == b.

        Ruby won't automatically validate this property for your code. It’s up to you to ensure that this property holds when you implement the equality methods.
        For example, recall the implementation of Employee#== from earlier in this article:

        The call to super on the first line makes this implementation of #== reflexive. This super invokes the default implementation of #==, which delegates to #equal?. Therefore, I could have used #equal? rather than super:

        I prefer using super, though this is likely a matter of taste.

        Hash Equality

        In Ruby, any object can be used as a key in a Hash. Strings, symbols, and numbers are commonly used as Hash keys, but instances of your own classes can function as Hash keys too—provided that you implement both #eql? and #hash.

        The #eql? Method

        The #eql? method behaves similarly to #==:

        However, #eql?, unlike #==, does not perform type coercion:

        If #== doesn’t perform type coercion, the implementations of #eql? and #== will be identical. Rather than copy-pasting, however, we’ll put the implementation in #eql?, and let #== delegate to #eql?:

        I made the deliberate decision to put the implementation in #eql? and let #== delegate to it, rather than the other way around. If we were to let #eql? delegate to #==, there’s an increased risk that someone will update #== and inadvertently break the properties of #eql? (mentioned below) in the process.

        For the Path value object, whose #== method does perform type coercion, the implementation of #eql? will differ from the implementation of #==:

        Here, #== does not delegate to #eql?, nor the other way around.

        A correct implementation of #eql? has the following two properties:

        • Property: Given objects a and b. If a.eql?(b), then a == b.
        • Property: Given objects a and b. If a.equal?(b), then a.eql?(b).

        These two properties are not explicitly called out in the Ruby documentation. However, to the best of my knowledge, all implementations of #eql? and #== respect this property.

        Ruby will not automatically validate that these properties hold in your code. It’s up to you to ensure that these properties aren’t violated.

        The #hash Method

        For an object to be usable as a key in a Hash, it needs to implement not only #eql?, but also #hash. This #hash method will return an integer, the hash code, that respects the following property:

        Property: Given objects a and b. If a.eql?(b), then a.hash == b.hash.

        Typically, the implementation of #hash creates an array of all attributes that constitute identity and returns the hash of that array. For example, here is Point#hash:

        For Path, the implementation of #hash will look similar:

        For the Employee class, which is an entity rather than a value object, the implementation of #hash will use the class and the @id:

        If two objects are not equal, the hash code should ideally be different, too. This isn’t mandatory, however. It’s okay for two non-equal objects to have the same hash code. Ruby will use #eql? to tell objects with identical hash codes apart.

        Avoid XOR for Calculating Hash Codes

        A popular but problematic approach for implementing #hash uses XOR (the ^ operator). Such an implementation would calculate the hash codes of each individual attribute, and combine these hash codes with XOR. For example:

        With such an implementation, the chance of a hash code collision, which means that multiple objects have the same hash code, is higher than with an implementation that delegates to Array#hash. Hash code collisions will degrade performance and could potentially pose a denial-of-service security risk.

        A better way, though still flawed, is to multiply the components of the hash code by unique prime numbers before combining them:

        Such an implementation has additional performance overhead due to the new multiplication. It also requires mental effort to ensure the implementation is and remains correct.

        An even better way of implementing #hash is the one I’ve laid out before—making use of Array#hash:

        An implementation that uses Array#hash is simple, performs quite well, and produces hash codes with the lowest chance of collisions. It’s the best approach to implementing #hash.

        Putting it Together

        With both #eql? and #hash in place, the Point, Path, and Employee objects can be used as hash keys:

        Here, we use a Hash instance to keep track of a collection of Points. We can also use a Set for this, which uses a Hash under the hood, but provides a nicer API:

        Objects used in Sets need to have an implementation of both #eql? and #hash, just like objects used as hash keys.

        Objects that perform type coercion, such as Path, can also be used as hash keys, and thus also in sets:

        We now have an implementation of equality that works for all kinds of objects.

        Mutability, Nemesis of Equality

        So far, the examples for value objects have assumed that these value objects are immutable. This is with good reason because mutable value objects are far harder to deal with.

        To illustrate this, consider a Point instance used as a hash key:

        The problem arises when changing attributes of this point:

        Because the hash code is based on the attributes, and an attribute has changed, the hash code is no longer the same. As a result, collection no longer seems to contain the point. Uh oh!

        There are no good ways to solve this problem except for making value objects immutable.

        This isn’t a problem with entities. This is because the #eql? and #hash methods of an entity are solely based on its explicit identity—not its attributes.

        So far, we’ve covered #==, #eql?, and #hash. These three methods are sufficient for a correct implementation of equality. However, we can go further to improve that sweet Ruby developer experience and implement #===.

        Case Equality (Triple Equals)

        The #=== operator, also called the case equality operator, isn’t really an equality operator at all. Rather, it’s better to think of it as a membership testing operator. Consider the following:

        Here, Range#=== checks whether a range covers a certain element. It’s also common to use case expressions to achieve the same:

        This is also where case equality gets its name. Triple-equals is called case equality, because case expressions use it.

        You never need to use case. It’s possible to rewrite a case expression using if and ===. In general, case expressions tend to look cleaner. Compare:

        The examples above all use Range#===, to check whether the range covers a certain number. Another commonly used implementation is Class#===, which checks whether an object is an instance of a class:

        I’m rather fond of the #grep method, which uses #=== to select matching elements from an array. It can be shorter and sweeter than using #select:

        Regular expressions also implement #===. You can use it to check whether a string matches a regular expression:

        It helps to think of a regular expression as the (infinite) collection of all strings that can be produced by it. The set of all strings produced by /[a-z]/ includes the example string "+491573abcde". Similarly, you can think of a Class as the (infinite) collection of all its instances, and a Range as the collection of all elements in that range. This way of thinking clarifies that #=== really is a membership testing operator.

        An example of a class that could implement #=== is a PathPattern class:

        An example instance is"/bin/*"), which matches anything directly under the /bin directory, such as /bin/ruby, but not /var/log.

        The implementation of PathPattern#=== uses Ruby’s built-in File.fnmatch to check whether the pattern string matches. Here is an example of it in use:

        Worth noting is that File.fnmatch calls #to_str on its arguments. This way, #=== automatically works on other string-like objects as well, such as Path instances:

        The PathPattern class implements #===, and therefore PathPattern instances work with case/when, too:

        Ordered Comparison

        For some objects, it’s useful not only to check whether two objects are the same, but how they are ordered. Are they larger? Smaller? Consider this Score class, which models the scoring system of my university in Ghent, Belgium.

        (I was a terrible student. I’m not sure if this was really how the scoring even worked — but as an example, it will do just fine.)

        In any case, we benefit from having such a Score class. We can encode relevant logic there, such as determining the grade and checking whether or not a score is passing. For example, it might be useful to get the lowest and highest score out of a list:

        However, as it stands right now, the expressions scores.min and scores.max will result in an error: comparison of Score with Score failed (ArgumentError). We haven’t told Ruby how to compare two Score objects. We can do so by implementing Score#&<=>:

        An implementation of #<=> returns four possible values:

        • It returns 0 when the two objects are equal.
        • It returns -1 when self is less than other.
        • It returns 1 when self is greater than other.
        • It returns nil when the two objects cannot be compared.

        The #<=> and #== operators are connected:

        • Property: Given objects a and b. If (a <=> b) == 0, then a == b.
        • Property: Given objects a and b. If (a <=> b) != 0, then a != b.

        As before, it’s up to you to ensure that these properties hold when implementing #== and #<=>. Ruby won’t check this for you.

        For simplicity, I’ve left out the implementation Score#== in the Score example above. It’d certainly be good to have that, though.

        In the case of Score#<=>, we bail out if other is not a score, and otherwise, we call #<=> on the two values. We can check that this works: the expression <=> evaluates to -1, which is correct because a score of 6 is lower than a score of 12. (Did you know that the Belgian high school system used to have a scoring system where 1 was the highest and 10 was the lowest? Imagine the confusion!)

        With Score#<=> in place, scores.max now returns the maximum score. Other methods such as #min, #minmax, and #sort work as well.

        However, we can’t yet use operators like <. The expression scores[0] < scores[1], for example, will raise an undefined method error: undefined method `<' for #<Score:0x00112233 @value=6>. We can solve that by including the Comparable mixin:

        By including Comparable, the Score class automatically gains the <, <=, >, and >= operators, which all call <=> internally. The expression scores[0] < scores[1] now evaluates to a boolean, as expected.

        The Comparable mixin also provides other useful methods such as #between? and #clamp.

        Wrapping Up

        We talked about the following topics:

        • the #== operator, used for basic equality, with optional type coercion
        • #equal?, which checks whether two objects are the same instance
        • #eql? and #hash, which are used for testing whether an object is a key in a hash
        • #===, which isn’t quite an equality operator, but rather a “is kind of” or “is member of” operator
        • #<=> for ordered comparison, along with the Comparable module, which provides operators such as < and >=

        You now know all you need to know about implementing equality in Ruby. For more information check out the following resources:

        The Ruby documentation is a good place to find out more about equality:

        I also found the following resources useful:

        Denis is a Senior Software Engineer at Shopify. He has made it a habit of thanking ATMs when they give him money, thereby singlehandedly staving off the inevitable robot uprising.

        If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by default.

        Continue reading

        Lessons Learned From Running Apache Airflow at Scale

        Lessons Learned From Running Apache Airflow at Scale

        By Megan Parker and Sam Wheating

        Apache Airflow is an orchestration platform that enables development, scheduling and monitoring of workflows. At Shopify, we’ve been running Airflow in production for over two years for a variety of workflows, including data extractions, machine learning model training, Apache Iceberg table maintenance, and DBT-powered data modeling. At the time of writing, we are currently running Airflow 2.2 on Kubernetes, using the Celery executor and MySQL 8.

        System diagram showing Shopify's Airflow Architecture
        Shopify’s Airflow Architecture

        Shopify’s usage of Airflow has scaled dramatically over the past two years. In our largest environment, we run over 10,000 DAGs representing a large variety of workloads. This environment averages over 400 tasks running at a given moment and over 150,000 runs executed per day. As adoption increases within Shopify, the load incurred on our Airflow deployments will only increase. As a result of this rapid growth, we have encountered a few challenges, including slow file access, insufficient control over DAG (directed acyclic graph) capabilities, irregular levels of traffic, and resource contention between workloads, to name a few.

        Below we’ll share some of the lessons we learned and solutions we built in order to run Airflow at scale.

        1. File Access Can Be Slow When Using Cloud Storage

        Fast file access is critical to the performance and integrity of an Airflow environment. A well defined strategy for file access ensures that the scheduler can process DAG files quickly and keep your jobs up-to-date.

        Airflow keeps its internal representation of its workflows up-to-date by repeatedly scanning and reparsing all the files in the configured DAG directory. These files must be scanned often in order to maintain consistency between the on-disk source of truth for each workload and its in-database representation. This means the contents of the DAG directory must be consistent across all schedulers and workers in a single environment (Airflow suggests a few ways of achieving this).

        At Shopify, we use Google Cloud Storage (GCS) for the storage of DAGs. Our initial deployment of Airflow utilized GCSFuse to maintain a consistent set of files across all workers and schedulers in a single Airflow environment. However, at scale this proved to be a bottleneck on performance as every file read incurred a request to GCS. The volume of reads was especially high because every pod in the environment had to mount the bucket separately.

        After some experimentation we found that we could vastly improve performance across our Airflow environments by running an NFS (network file system) server within the Kubernetes cluster. We then mounted this NFS server as a read-write-many volume into the worker and scheduler pods. We wrote a custom script which synchronizes the state of this volume with  GCS, so that users only have to interact with GCS for uploading or managing DAGs. This script runs in a separate pod within the same cluster. This also allows us to conditionally sync only a subset of the DAGs from a given bucket, or even sync DAGs from multiple buckets into a single file system based on the environment’s configuration (more on this later).

        Altogether this provides us with fast file access as a stable, external source of truth, while maintaining our ability to quickly add or modify DAG files within Airflow. Additionally, we can use Google Cloud Platform’s IAM (identify and access management) capabilities to control which users are able to upload files to a given environment. For example, we allow users to upload DAGs directly to the staging environment but limit production environment uploads to our continuous deployment processes.

        Another factor to consider when ensuring fast file access when running Airflow at scale is your file processing performance. Airflow is highly configurable and offers several ways to tune the background file processing (such as the sort modethe parallelism, and the timeout). This allows you to optimize your environments for interactive DAG development or scheduler performance depending on the requirements.

        2. Increasing Volumes Of Metadata Can Degrade Airflow Operations

        In a normal-sized Airflow deployment, performance degradation due to metadata volume wouldn’t be an issue, at least within the first years of continuous operation.

        However, at scale the metadata starts to accumulate pretty fast. After a while this can start to incur additional load on the database. This is noticeable in the loading times of the Web UI and even more so during Airflow upgrades, during which migrations can take hours.

        After some trial and error, we settled on a metadata retention policy of 28 days, and implemented a simple DAG which uses ORM (object–relational mapping) queries within a PythonOperator to delete rows from any tables containing historical data (DagRuns, TaskInstances, Logs, TaskRetries, etc). We settled on 28 days as this gives us sufficient history for managing incidents and tracking historical job performance, while keeping the volume of data in the database at a reasonable level.

        Unfortunately, this means that features of Airflow which rely on durable job history (for example, long-running backfills) aren’t supported in our environment. This wasn’t a problem for us, but it may cause issues depending on your retention period and usage of Airflow.

        As an alternative approach to a custom DAG, Airflow has recently added support for a db clean command which can be used to remove old metadata. This command is available in Airflow version 2.3.

        3. DAGs Can Be Hard To Associate With Users And Teams

        When running Airflow in a multi-tenant setting (and especially at a large organization), it’s important to be able to trace a DAG back to an individual or team. Why? Because if a job is failing, throwing errors or interfering with other workloads, us administrators can quickly reach out to the appropriate users.

        If all of the DAGs were deployed directly from one repository, we could simply use git blame to track down the job owner. However, since we allow users to deploy workloads from their own projects (and even dynamically generate jobs at deploy-time), this becomes more difficult.

        In order to easily trace the origin of DAGs, we introduced a registry of Airflow namespaces, which we refer to as an Airflow environment’s manifest file.

        The manifest file is a YAML file where users must register a namespace for their DAGs. In this file they will include information about the jobs’ owners and source github repository (or even source GCS bucket), as well as define some basic restrictions for their DAGs. We maintain a separate manifest per-environment and upload it to GCS alongside with the DAGs.

        4. DAG Authors Have A Lot Of Power

        By allowing users to directly write and upload DAGs to a shared environment, we’ve granted them a lot of power. Since Airflow is a central component of our data platform, it ties into a lot of different systems and thus jobs have wide-ranging access. While we trust our users, we still want to maintain some level of control over what they can and cannot do within a given Airflow Environment. This is especially important at scale as it becomes unfeasible for the Airflow administrators to review all jobs before they make it to production.

        In order to create some basic guardrails, we’ve implemented a DAG policy which reads configuration from the previously mentioned Airflow manifest, and rejects DAGs which don’t conform to their namespace’s constraints by raising an AirflowClusterPolicyViolation.

        Based on the contents of the manifest file, this policy will apply a few basic restrictions to DAG files, such as:

        • A DAG ID must be prefixed with the name of an existing namespace, for ownership.
        • Tasks in a DAG must only enqueue tasks to the specified celery queue—more on this later.
        • Tasks in a DAG can only be run in specified pools, to prevent one workload from taking over another’s capacity.
        • Any KubernetesPodOperators in this DAG must only launch pods in the specified namespaces, to prevent access to other namespace’s secrets.
        • Tasks in a DAG can only launch pods into specified sets of external kubernetes clusters

        This policy can be extended to enforce other rules (for example, only allowing a limited set of operators), or even mutate tasks to conform to a certain specification (for exampke, adding a namespace-specific execution timeout to all tasks in a DAG).

        Here’s a simplified example demonstrating how to create a DAG policy which reads the previously shared manifest file, and implements the first three of the controls mentioned above:

        These validations provide us with sufficient traceability while also creating some basic controls which reduce DAGs abilities to interfere with each other.

        5. Ensuring A Consistent Distribution Of Load Is Difficult

        It’s very tempting to use an absolute interval for your DAGs schedule interval—simply set the DAG to run every timedelta(hours=1) and you can walk away, safely knowing that your DAG will run approximately every hour. However, this can lead to issues at scale.

        When a user merges a large number of automatically-generated DAGs, or writes a python file which generates many DAGs at parse-time, all the DAGRuns will be created at the same time. This creates a large surge of traffic which can overload the Airflow scheduler, as well as any external services or infrastructure which the job is utilizing (for example, a Trino cluster).

        After a single schedule_interval has passed, all these jobs will run again at the same time, thus leading to another surge of traffic. Ultimately, this can lead to suboptimal resource utilization and increased execution times.

        While crontab-based schedules won’t cause these kinds of surges, they come with their own issues. Humans are biased towards human-readable schedules, and thus tend to create jobs which run at the top of every hour, every hour, every night at midnight, etc. Sometimes there’s a valid application-specific reason for this (for example, every night at midnight we want to extract the previous day’s data), but often we have found users just want to run their job on a regular interval. Allowing users to directly specify their own crontabs can lead to bursts of traffic which can impact SLOs and put uneven load on external systems.

        As a solution to both these issues, we use a deterministically randomized schedule interval for all automatically generated DAGs (which represent the vast majority of our workflows). This is typically based on a hash of a constant seed such as the dag_id.

        The below snippet provides a simple example of a function which generates deterministic, random crontabs which yield constant schedule intervals. Unfortunately, this limits the range of possible intervals since not all intervals can be expressed as a single crontab. We have not found this limited choice of schedule intervals to be limiting, and in cases when we really need to run a job every five hours, we just accept that there will be a single four hours interval each day.

        Thanks to our randomized schedule implementation, we were able to smooth the load out significantly. The below image shows the number of tasks completed every 10 minutes over a twelve hour period in our single largest Airflow environment.

        Bar graph showing the Tasks Executed versus time. Shows a per 10–minute Interval in our Production Airflow Environment
        Tasks Executed per 10–minute Interval in our Production Airflow Environment

        6. There Are Many Points of Resource Contention

        There’s a lot of possible points of resource contention within Airflow, and it’s really easy to end up chasing bottlenecks through a series of experimental configuration changes. Some of these resource conflicts can be handled within Airflow, while others may require some infrastructure changes. Here’s a couple of ways which we handle resource contention within Airflow at Shopify:


        One way to reduce resource contention is to use Airflow pools. Pools are used to limit the concurrency of a given set of tasks. These can be really useful for reducing disruptions caused by bursts in traffic. While pools are a useful tool to enforce task isolation, they can be a challenge to manage because only administrators have access to edit them via the Web UI.

        We wrote a custom DAG which synchronizes the pools in our environment with the state specified in a Kubernetes Configmap via some simple ORM queries. This lets us manage pools alongside the rest of our Airflow deployment configuration and allows users to update pools via a reviewed Pull Request without needing elevated access. 

        Priority Weight

        Priority_weight allows you to assign a higher priority to a given task. Tasks with a higher priority will float to the top of the pile to be scheduled first. Although not a direct solution to resource contention, priority_weight can be useful to ensure that latency-sensitive critical tasks are run prior to lower priority tasks. However, given that the priority_weight is an arbitrary scale, it can be hard to determine the actual priority of a task without comparing it to all other tasks. We use this to ensure that our basic Airflow monitoring DAG (which emits simple metrics and powers some alerts) always runs as promptly as possible.

        It’s also worthwhile to note that by default, the effective priority_weight of a task used when making scheduling decisions is the sum of its own weight and that of all its downstream tasks. What this means is that upstream tasks in large DAGs are often favored over tasks in smaller DAGs. Therefore, using priority_weight requires some knowledge of the other DAGs running in the environment.

        Celery Queues and Isolated Workers

        If you need your tasks to execute in separate environments (for example, dependencies on different python libraries, higher resource allowances for intensive tasks, or differing level of access), you can create additional queues which a subset of jobs submit tasks to. Separate sets of workers can then be configured to pull from separate queues. A task can be assigned to a separate queue using the queue argument in operators. To start a worker which runs tasks from a different queue, you can use the following command:

        bashAirflow celery worker –queues <list of queues>

        This can help ensure that sensitive or high-priority workloads have sufficient resources, as they won’t be competing with other workloads for worker capacity.

        Any combination of pools, priority weights and queues can be useful in reducing resource contention. While pools allow for limiting concurrency within a single workload, a priority_weight can be used to make individual tasks run at a lower latency than others. If you need even more flexibility, worker isolation provides fine-grained control over the environment in which your tasks are executed.

        It’s important to remember that not all resources can be carefully allocated in Airflow—scheduler throughput, database capacity and Kubernetes IP space are all finite resources which can’t be restricted on a workload-by-workload basis without the creation of isolated environments.

        Going Forward…

        There are many considerations that go into running Airflow with such high throughput, and any combination of solutions can be useful. We’ve learned a ton and we hope you’ll remember these lessons and apply some of our solutions in your own Airflow infrastructure and tooling.

        To sum up our key takeaways:

        • A combination of GCS and NFS allows for both performant and easy to use file management.
        • Metadata retention policies can reduce degradation of Airflow performance.
        • A centralized metadata repository can be used to track DAG origins and ownership.
        • DAG Policies are great for enforcing standards and limitations on jobs.
        • Standardized schedule generation can reduce or eliminate bursts in traffic.
        • Airflow provides multiple mechanisms for managing resource contention.

        What’s next for us? We’re currently working on applying the principles of scaling Airflow in a single environment as we explore splitting our workloads across multiple environments. This will make our platform more resilient, allow us to fine-tune each individual Airflow instance based on its workloads’ specific requirements, and reduce the reach of any one Airflow deployment.

        Got questions about implementing Airflow at scale? You can reach out to either of the authors on the Apache Airflow slack community.

        Megan has worked on the data platform team at Shopify for the past 9 months where she has been working on enhancing the user experience for Airflow and Trino. Megan is located in Toronto, Canada where she enjoys any outdoor activity, especially biking and hiking.

        Sam is a Senior developer from Vancouver, BC who has been working on the Data Infrastructure and Engine Foundations teams at Shopify for the last 2.5 years. He is an internal advocate for open source software and a recurring contributor to the Apache Airflow project.

        Interested in tackling challenging problems that make a difference? Visit our Data Science & Engineering career page to browse our open positions. You can also contribute to Apache Airflow to improve Airflow for everyone.

        Continue reading

        Asynchronous Communication is the Great Leveler in Engineering

        Asynchronous Communication is the Great Leveler in Engineering

        In March 2020—the early days of the pandemic—Shopify transitioned to become a remote first- company. We call it being Digital by Design. We are now proud to employ Shopifolk around the world.

        Not only has being Digital by Design allowed our staff the flexibility to work from wherever they work best, it has also increased the amount of time that they are able to spend with their families, friends, or social circles. In the pre-remote world, many of us had to move far away from our hometowns in order to get ahead in our careers. However, recently, my own family has been negotiating with the reality of aging parents, and remote working has allowed us to move back closer to home so we are always there for them whilst still doing the jobs we love.

        However, being remote isn’t without its challenges. Much of the technology industry has spent decades working in colocated office space. This has formed habits in all of us that aren’t compatible with effective remote work. We’re now on a journey of mindfully unraveling these default behaviors and replacing them with remote-focused ways of working.

        If you’ve worked in an office before, you’ll be familiar with synchronous communication and how it forms strong bonds between colleagues: a brief chat in the kitchen whilst getting a coffee, a discussion at your desk that was prompted by a recent code change, or a conversation over lunch.

        With the layout of physical office spaces encouraging spontaneous interactions, context could be gathered and shared through osmosis with little specific effort—it just happened.

        There are many challenges that you face in engineering when working on a globally distributed team. Not everyone is online at the same time of the day, meaning that it can be harder to get immediate answers to questions. You might not even know who to ask when you can’t stick your head up and look around to see who is at their desks. You may worry about ensuring that the architectural direction that you’re about to take in the codebase is the right one when building for the long term—how can you decide when you’re sitting at home on your own?

        Teams have had to shift to using a different toolbox of skills now that everyone is remote. One such skill is the shift to more asynchronous communication: an essential glue that holds a distributed workforce together. It’s inclusive of teams in different time zones, it leaves an audit trail of communication and decision making, encourages us to communicate concisely, and enables everybody the same window into the company, regardless of where they are in the world.

        However, an unstructured approach can be challenging, especially when teams are working on establishing their communication norms. It helps to have a model with which to reason about how best to communicate for a given purpose and to understand what side-effects of that communication might be.

        The Spectrum of Synchronousness

        When working remotely, we have to adapt to a different landscape. The increased flexibility of working hours, the ability to find flow and do deep work, and the fact that our colleagues are in different places means we can’t rely on the synchronous, impromptu interactions of the office anymore. We have to navigate a continuum of communication choices between synchronous and asynchronous, choosing the right way to communicate for the right group at the right time.

        It’s possible to represent different types of communication on a spectrum, as seen in the diagram below.

        A diagram showing spectrum of communication types with the extremes being Synchronous Impermanent Connect and Asynchronous Permanent Disconnected
        The spectrum of communication types

        Let’s walk the spectrum from left to right—from synchronous to asynchronous—to understand the kinds of choices that we need to make when communicating in a remote environment.

        • Video calls and pair programming are completely synchronous: all participants need to be online at the same time.
        • Chats are written and can be read later, but due to their temporal nature have a short half-life. Usually there’s an expectation that they’ll be read or replied to fairly quickly, else they’re gone.
        • Recorded video is more asynchronous, however they’re typically used as a way of broadcasting some information or complimenting a longer document, and their relevance can fade rapidly.
        • Email is archival and permanent and is typically used for important communication. People may take many days to reply or not reply at all.
        • Written documents are used for technical designs, in-depth analysis, or cornerstones of projects. They may be read many years after they were written but need to be maintained and often represent a snapshot in time.
        • Wikis and READMEs are completely asynchronous, and if well-maintained, can last and be useful forever.

        Shifting to Asynchronous

        When being Digital by Design, we have to be intentionally more asynchronous. It’s a big relearning of how to work collaboratively. In offices, we could get by synchronously, but there was a catch: colleagues at home, on vacation, or in different offices would have no idea what was going on. Now we’re all in that position, we have to adapt in order to harness all of the benefits of working with a global workforce.

        By treating everyone as remote, we typically write as a primary form of communication so that all employees can have access to the same information wherever they are. We replace meetings with asynchronous interactions where possible so that staff have more flexibility over their time. We record and rebroadcast town halls so that staff in other timezones can experience the feeling of watching them together. We document our decisions so that others can understand the archeology of codebases and projects. We put effort into editing and maintaining our company-wide documentation in all departments, so that all employees have the same source of truth about teams, the organization chart, and projects.

        This shift is challenging, but it’s worthwhile: effective asynchronous collaboration is how engineers solve hard problems for our merchants at scale, collaborating as part of a global team. Whiteboarding sessions have been replaced with the creation of collaborative documents in tools such as Miro. In-person Town Halls have been replaced with live streamed events that are rebroadcast on different time zones with commentary and interactions taking place in Slack. The information that we all have in our heads has needed to be written, recorded, and documented. Even with all of the tools provided, it requires a total mindset shift to use them effectively.

        We’re continually investing in our developer tools and build systems to enable our engineers to contribute to our codebases and ship to production any time, no matter where they are. We’re also investing in internal learning resources and courses so that new hires can autonomously level up their skills and understand how we ship software. We have regular broadcasts of show and tell and demo sessions so that we can all gather context on what our colleagues are building around the world. And most importantly, we take time to write regular project and mission updates so that everyone in the company can feel the pulse of the organization.

        Asynchronous communication is the great leveler: it connects everyone together and treats everyone equally.

        Permanence in Engineering

        In addition to giving each employee the same window into our culture, asynchronous communication also has the benefit of producing permanent artifacts. These could be written documents, pull requests, emails, or videos. As per our diagram, the more asynchronous the communication, the more permanent the artifact. Therefore, shifting to asynchronous communication means that not only are teams able to be effective remotely, but they also produce archives and audit trails for their work.

        The whole of Shopify uses a single source of truth—an internal archive of information called the Vault. Here, Shopifolk can find all of the information that they need to get their work done: information on teams, projects, the latest news and video streams, internal podcasts and blog posts. Engineers can quickly find architecture diagrams and design documents for active projects.

        When writing design documents for major changes to the codebase, a team produces an archive of their decisions and actions through time. By producing written updates to projects every week, anyone in the company can capture the current context and where it has been derived from. By recording team meetings and making detailed minutes, those that were unable to attend can catch up later on-demand. A shift to asynchronous communication means a shift to implied shift to permanence of communication, which is beneficial for discovery[g][h][i], reflection and understanding.

        For example, when designing new features and architecture, we collaborate asynchronously on design documents via GitHub. New designs are raised as issues in our technical designs repository, which means that all significant changes to our codebase are reviewed, ratified and archived publicly. This mirrors how global collaboration works on the open source projects we know and love. Working so visibly can be intimidating for those that haven’t done it before, so we ensure that we mentor and pair with those that are doing it for the first time.

        Establishing Norms and Boundaries

        Yet, multiple mediums of communication incur many choices in how to use them effectively. When you have the option to communicate via chat, email, collaborative document or GitHub issue, picking the right one can become overwhelming and frustrating. Therefore we encourage our teams to establish their preferred norms and to write them down. For example:

        • What are the response time expectations within a team for chat versus email?
        • How are online working hours clearly discoverable for each team member?
        • How is consensus reached on important decisions?
        • Is a synchronous meeting ever necessary?
        • What is the correct etiquette for “raising your hand” in video calls?
        • Where are design documents stored so they’re easily accessible in the future?

        By agreeing upon the right medium to use for given situations, teams can work out what’s right for them in a way that supports flexibility, autonomy, and clarity. If you’ve never done this in your team, give it a go. You’ll be surprised how much easier it makes your day to day work.

        The norms that our teams define bridge both synchronous and asynchronous expectations. At Shopify, my team members ensure that they make the most of the windows of overlap that they have each day, setting aside time to be interruptible for pair programming, impromptu chats and meetings, and collaborative design sessions. Conversely, the times of the day when teams have less overlap are equally important. Individuals are encouraged to block out time in the calendar, turn on their “do not disturb” status, and find the space and time to get into a flow state and be productive

        A natural extension of the communication norms cover when writing and shipping code. Given that our global distribution of staff can potentially incur delays when it comes to reviewing, merging, and deploying, teams are encouraged to explore and define how they reach alignment and get things shipped. This can happen by raising the priority of reviewing pull requests created in other timezones first thing in the morning before getting on with your own work, through to finding additional engineers that may be external to the team but on the same timezone to lean in and offer support, review, and pairing.

        Maintaining Connectivity

        Once you get more comfortable with working ever more asynchronously, it can be tempting to want to make everything asynchronous: stand-ups on Slack, planning in Miro, all without needing anyone to be on a video call at all. However, if we look back at the diagram one more time, we’ll see that there’s an important third category: connectivity. Humans are social beings, and feeling connected to other, real humans—not just avatars—is critical to our wellbeing. This means that when shifting to asynchronous work we also need to ensure that we maintain that connection. Sometimes having a synchronous meeting can be a great thing, even if it’s less efficient—the ability to see other faces, and to chat, can’t be taken for granted.

        We actively work to ensure that we remain connected to each other at Shopify. Pair programming is a core part of our engineering culture, and we love using Tuple to solve problems collaboratively, share context about our codebases, and provide an environment to help hundreds of new engineers onboard and gain confidence working together with us.

        We also strongly advocate for plenty of time to get together and have fun. And no, I’m not talking about generic, awkward corporate fun. I’m talking about hanging out with colleagues and throwing things at them in our very own video game: Shopify Party (our internal virtual world for employees to play games or meet up). I’m talking about your team spending Friday afternoon playing board games together remotely. And most importantly, I’m talking about at least twice a year, we encourage teams to come together in spaces we’ve invested in around the world for meaningful and intentional moments for brainstorming, team building, planning, and establishing connections offline.

        Asynchronous brings efficiency, and synchronous brings connectivity. We’ve got both covered at Shopify.

        James Stanier is Director of Engineering at Shopify. He is also the author of Become an Effective Software Manager and Effective Remote Work. He holds a Ph.D. in computer science and runs

        Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

        Continue reading

        Double Entry Transition Tables: How We Track State Changes At Shopify

        Double Entry Transition Tables: How We Track State Changes At Shopify

        Recently we launched Shopify Balance, a money management account and card that gives Shopify merchants quick access to their funds with no fees. After the beta launch of Shopify Balance, the Shopify Data team was brought in to answer the question: how do we reliably count the number of merchants using Balance? In particular, how do we count this historically?

        While this sounds like a simple question, it’s foundationally critical to knowing if our product is a success and if merchants are actually using it. It’s also more complicated than it seems to answer.

        To be considered as using Shopify Balance, a merchant has to have both an active Shopify Balance account and an active Shopify account. This means we needed to build something to track the state changes of both accounts simultaneously, and make that tracking robust and reliable over time. Enter double entry transition tables. While very much an “invest up front and save a ton of time in the long run” strategy, double entry transition tables give us the flexibility to see the individual inputs that cause a given change. It does all of this while simplifying our queries and reducing long term maintenance on our reporting.

        In this post, we’ll explore how we built a data pipeline using double entry transition tables to answer our question: how many Shopify merchants are using Shopify Balance? We’ll go over how we designed something that scales as our product grows in complexity, the benefits of using double entry transition tables—from ease of use to future proofing our reporting—and some sample queries using our new table.

        What Are Double Entry Transition Tables?

        Double entry transition tables are essentially a data presentation format that tracks changes in attributes of entities over time. At Shopify, one of our first use cases of a double entry transition table was used to track the state of merchants using the platform, allowing us to report on how many merchants have active accounts. In comparison to a standard transition table that has from and to columns, double entry transition tables output two rows for each state change, along with a new net_change column. They can also combine many individual tracked attributes into a single output.

        It took me a long time to wrap my head around this net_change column, but it essentially works like this: if you want to track the status of something over time, every time the status changes from one state to another or vice versa, there will be two entries:

        1. net_change = -1: this row is the previous state
        2. net_change = +1: this row is the new state

        Double entry transition tables have many advantages including:

        • The net_change column is additive: this is the true benefit of using this type of table. This allows you to quickly get the number of entities that are in a certain state by summing up net_change while filtering for the state you care about.
        • Identifying cause of change: for situations where you care about an overall status (one that depends on several underlying statuses), you can go into the table and see which of the individual attributes caused the change.
        • Preserving all timing information: the output preserves all timing information, and even correctly orders transitions that have identical timestamps. This is helpful for situations where you need to know something like the duration of a given status.
        • Easily scaled with additional attributes: if the downstream dependencies are written correctly, you can add additional attributes to your table as the product you’re tracking grows in complexity. The bonus is that you don’t have to rewrite any existing SQL or PySpark, all thanks to the additive nature of the net_change column.

        For our purpose of identifying how many merchants are using Shopify Balance, double entry transition tables allow us to track state changes for both the Shopify Balance account and the Shopify account in a single table. It also gives us a clean way to query the status of each entity over time. But how do we do this?

        Building Our Double Entry Transition Pipelines

        First, we need to prepare individual attribute tables to be used as inputs for our double entry transition data infrastructure. We need at least one attribute, but it can scale to any number of attributes as the product we’re tracking grows.

        In our case, we created individual attribute tables for both the Shopify Balance account status and the Shopify account status. An attribute input table must have a specific set of columns:

        • a partition key that’s common across attribute, which in our case is an account_id
        • a sort key, generally a transition_at timestamp and an index
        • an attribute you want to track.

        Using a standard transition table, we can convert it to an attribute with a simple PySpark job:

        Note the index column. We created this index using a row number window function, ordering by the transition_id any time we have duplicate account_id and transition_at sets in our original data. While simple, it serves the role of a tiebreak should there be two transition events with identical timestamps. This ensures we always have a unique account_id, transition_at, index set in our attribute for correct ordering of events. The index plays a key role later on when we create our double entry transition table, ensuring we’re able to capture the order of our two states.

        Our Shopify Balance status attribute table showing a merchant that joined and left Shopify Balance.

        Now that we have our two attribute tables, it’s time to feed these into our double entry transition pipelines. This system (called build merge state transitions) takes our individual attribute tables and first generates a combined set of unique rows using a partition_key (in our case, the account_id column), and a sort_key (in our case, the transition_at and index columns). It then creates one column per attribute, and fills in the attribute columns with values from their respective tables, in the order defined by the partition_key and sort_key. Where values are missing, it fills in the table using the previous known value for that attribute. Below you can see two example attributes being merged together and filled in:

        Two example attributes merged into a single output table.

        This table is then run through another process that creates our net_change column and assigns a +1 value to all current rows. It also inserts a second row for each state change with a net_change value of -1. This net_change column now represents the direction of each state change as outlined earlier.

        Thanks to our pipeline, setting up a double entry transition table is a very simple PySpark job:

        Note in the code above we’ve specified default values. These are used to fill in the initial null values for the attributes. Now below is the output of our final double entry transition table, which we call our accounts_transition_facts table. The table captures both a merchant’s Shopify and Shopify Balance account statuses over time. Looking at the shopify_status column, we can see they went from inactive to active in 2018, while the balance_status column shows us that they went from not_on_balance to active on March 14, 2021, and subsequently from active to inactive on April 23, 2021:

        A merchant that joined and left Shopify Balance in our accounts_transition_facts double entry transition table.

        Using Double Entry Transition Tables

        Remember how I mentioned that the net_change column is additive? This makes working with double entry transition tables incredibly easy. The ability to sum the net_change column significantly reduces the SQL or PySpark needed to get counts of states. For example, using our new account_transition_facts table, we can identify the total number of active accounts on Shopify Balance, using both the Shopify Balance status and Shopify status. All we have to do is sum our net_change column while filtering for the attribute statuses we care about:

        Add in a grouping on a date column and we can see the net change in accounts over time:

        We can even use the output in other PySpark jobs. Below is an example of a PySpark job consuming the output of our account_transition_facts table. In this case, we are adding the daily net change in account numbers to an aggregate daily snapshot table for Shopify Balance:

        There are many ways you can achieve the same outputs using SQL or PySpark, but having a double entry transition table in place significantly simplifies the code at query time. And as mentioned earlier, if you write the code using the additive net_change column, you won’t need to rewrite any SQL or PySpark when you add more attributes to your double entry transition table.

        We won’t lie, it took a lot of time and effort to build the first version of our account_transition_facts table. But thanks to our investment, we now have a reliable way to answer our initial question: how do we count the number of merchants using Balance? It’s easy with our double entry transition table! Grouping by the status we care about, simply sum net_change and viola, we have our answer.

        Not only does our double entry transition table simply and elegantly answer our question, but it also easily scales with our product. Thanks to the additive nature of the net_change column, we can add additional attributes without impacting any of our existing reporting. This means this is just the beginning for our account_transition_facts table. In the coming months, we’ll be evaluating other statuses that change over time, and adding in those that make sense for Shopify Balance into our table. Next time you need to reliably count multiple states, try exploring double entry transition tables.

        Justin Pauley is a Data Scientist working on Shopify Balance. Justin has a passion for solving complex problems through data modeling, and is a firm believer that clean data leads to better storytelling. In his spare time he enjoys woodworking, building custom Lego creations, and learning new technologies. Justin can be reached via LinkedIn.

        Are you passionate about data discovery and eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.

        If you’re interested in building solutions from the ground up and would like to come work with us, please check out Shopify’s career page.

        Continue reading

        Start your free 14-day trial of Shopify