Stop trying to do it all alone, add Kit to your team. Learn more.
Connecting with Mob Programming

Connecting with Mob Programming

We were a team of six people from three different teams with half of us completely new to the Shipping domain and some of us new to the company. We had six weeks to complete one ambitious project. Most of us had never met in person, let alone worked together. We worked like any other team: picked items off the backlog and worked on them. Everyone said they were open to pairing but very few paired, or if they paired it was with the one person they knew very well. Pull requests (PRs) came in but feedback was scarce. Everyone was new to each other and the domain, so opinions were rare, and when present, weak. PRs had a couple of nits, but otherwise they'd just go through. We may have been shipping, but growing, connecting, and learning, but were unsure.

Until one day, our teammate, Sheldon Nunes, introduced us to mob programming. Mob programming builds on pair programming but instead of two people pairing, it's an entire mob—more than two people—pairing on the same problem.

What is Mob Programming?

It started innocently enough, as none of us had done mob programming before. Six of us joined a video call and the driver shared their screen. It was ineffective. It wasn’t inclusive or engaging and ended up as pair programming with observers. We had 30 minutes of two individuals contributing, and everyone else had zoned out. Until someone asked, "How do we make sure that everyone doesn't fall asleep?!" Surely enough, mobbing has a solution for that: fast rotations of the driver.

Sheldon, our mob programming expert, suggested we switch to a 10-minute rotation of one driver. At the end of every rotation, the driver pushes the changes, and the next person pulls the changes and takes over. It worked like magic. By taking turns and having a short duration, everyone was forced into engagement or they would be lost on their turn. We made programming a game.

A mob of five people rotating every 10 minutes is 50 minutes of work per rotation. Though the 10 minutes passed quickly, we also moved swiftly and kept tight alignment. The fast rotation also meant that we made decisions quickly—nobody wanted a turn to end without having shipped anything—and every decision was reversible, so it hardly made sense not to be decisive. We saw the same with how much context one shared with the group. There was no risk of a 30-minute context dump by one individual who had high context because the short rotation forced people to share just enough context to get something done. Code reviews also became moot—everyone wrote the code together, so there was little back and forth, allowing us to ship even faster.

The most valuable benefits we saw with mob programming was the strength of our relationships after we started doing them. It was so effective, we noticed it immediately following the first session. Feedback was easier to give and receive because it wasn't a judgement but a collaboration. While collaborating so closely, we were able to learn from watching each other's unique workflows, laugh at each other’s scribbles and animal drawings, and engage in spontaneous games of tic tac toe.

The Five Lessons of Mob Programming

For Three months, the team performed mob programming almost daily. That intensive experience taught us a few things.

1. Pointing and Communicating

Being able to point with a crudely drawn arrow is important. Drawing increases the ways you can interact, changing from verbal only to verbal and visual, but most importantly, it keeps everyone engaged. When mobbing, a 30 second build feels like eternity - and being able to doodle or even see someone else draw doodles on the screen changes the engagement level of the group.

We tried one session without drawing and while it can work, it is an exercise of frustration as you try to explain to the driver exactly where to scroll, which character on a line the bracket is missing, and where exactly the spelling error is.

2. Discoverability Matters

Our first mobbing session came out of an informal coffee chat. We used Slack's /call feature for pairing so members of the team who weren't in the informal coffee chat could join at a later time. We started this in a private group with a direct message, but faced challenges such as not being able to add any "guests" who may have had the context on what we're trying to solve who we wanted to add to our mob. A call in a small private group also puts pressure on the whole team to join, irrespective of their availability. So we moved it to a public channel.

An active Slack huddle window that shows the profile photos of the attendees and a Show Call button

A mob that’s discoverable, so people can drop in and drop out, ensures that the mob doesn't "die off" and people can take a break. For us, this means using Slack huddles with screen share and Slack /call in a public channel. Give it a fun name or an obvious name, but keep it public.

3. The Right Problem in the Right Environment

A mob that’s rotating the driver constantly, like ours, requires a problem where people can set up the environment quickly. Have one branch and a simple setup. A single rotation should involve:

git pull
<mob>
git commit
-a -m 'WIP!!!"
git push

Yes, the good commit messages get ditched here. It's very possible to end your rotation with debugging statements in code. That's OK. Add a good commit message when a task is complete, not necessarily at every push. This reduces how long a hand off takes and allows rotations to happen without waiting for a "clean exit."

Writing tests (or even this article!) is a poor experience for mobbing. For tests, the runtime for tests is too long to be effective for a mob. These tasks are better in a pairing environment or solo activities, so often someone would volunteer for ownership of the task to take it to completion. For documentation, it's pretty hard to write a sentence together.

4. Invite Other Disciplines

The nature of mob programming means that non-developers can mob with developers. Sometimes it’s Directors who rarely get to code in their day to day or a Product Manager who’s curious. The point is that anyone can mob because the mob is available to help. The driver is expected to not know what to do, and by making that the default experience, mobbing becomes welcoming for developers of all skill levels.

5. Take a Break

Time passes fast in a mob. We found two hours is the maximum length. Mobbing sessions can drain the introverts in the team. Timebox it and set a limit to minimize the feeling of "missing out" for members of the team who are not able to participate.

Remote work changed for all of us permanently that day. Gone were the days of lamenting over the loss of learning through osmosis. In person, we learned from each other by overhearing conversations, but with remote work that quickly went away as conversations moved into private messages or meetings—unless you asked the question, you didn't get to hear the answer. There was no learning new shortcuts and tricks from your coworkers by happening to walk by. However, with mobbing, all of that was back. Arguably pairing should've done this too, but the key with mobbing is that you don't have to ask the questions or give the answers—you can learn from the conversations of others.

An ended Slack huddle window that shows the profile photos of the attendees and the amount of time the huddle lasted.

Before we were suffering from isolation and feeling disconnected from the team, now we were over-socialized and had to introduce no-pairing days to give people a chance to recharge. We’re now able to onboard newcomers as mob programming welcomes low-context—you have an entire mob to help you, after all.

Swati Swoboda is a Development Manager at Shopify. She has been at Shopify for 3.5 years and currently leads the Shipping Platform team.


If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together - a future that is digital by default.

Continue reading

A Guide to Running an Engineering Program

A Guide to Running an Engineering Program

In 2020, Shopify celebrated 15 years of coding, building, and changing the global commerce landscape forever. In that time, the team built an enormous amount of tooling, products, features, and an insanely scalable architecture.

With the velocity and pace we’ve been keeping, through growing surface areas that enable commerce for 8% of the world’s adult population and with an evolving architecture to support our rapid rate of development, the platform complexity has increased exponentially.  Despite this, we still look hard problems straight in the face and decide that despite how complex it seems, it's what we want as a business. We can and will do this.

We’ve taken on huge engineering programs in the past that cross over multiple areas of complexity on the platform. These teams deliver features such as the Storefront Renderer, world class performance for events like BFCM, platform wide capacity planning and testing, and efficient tooling for engineers at Shopify like static typing.

Shopify has a playbook for running these engineering programs and we’d like to share it with you.

Defining the Program

All programs require a clearly defined definition of done or the North Star if you will. This clarity and alignment is absolutely essential to ensure the team is all going in the same direction. For us, a number of documented assets are produced that enable company wide alignment and provide a framework for contextual status updates, risk mitigation, and decisions.

To be aligned is for all stakeholders to agree on the Program Plan in its entirety:

  • the length of time
  • the scope of the Program
  • the staffing assigned
  • the outcomes of the program. 

The program stakeholders are typically Directors and VPs for the area the program will affect. Any technical debt or decisions made along the way are critical for these stakeholders as they inherit it as leaders of that area. The Program Steering Committee includes the Program Stakeholders and the selected Program Lead(s) who together define the program and set the stage for the team to move into action. Here’s how we frame it out:

Problem Statement

Exactly what is the issue? This is necessary for buy-in across executives and organizational leaders. Be clear, be specific. Questions we ask include

  • What can our users or the company do after this goal is achieved?
  • Why should we solve this problem now?
  • What aren’t we doing in order to prioritize this? 
  • Is that the right tradeoff for the company?

Objectives of Program

Objectives of the program become a critical drum for motivation and negotiation when resources become scarce or other programs are gaining momentum. To come up with good objectives, consider answers to these questions:

  • When we’re done, what’s now possible? 
  • As a company, what do we gain for choosing to invest in this effort over the countless other investments?  
  • What can merchants, developers, and Shopify do after this goal is achieved? 
  • What aren’t we doing in order to prioritize this? 
  • Is that the right tradeoff for the company?

Guiding Principles of the Program

What guiding principles govern the solution? Spend time on this as it’s a critical tool for making tradeoff decisions or forcing constraints in solutions.

Definition of Done

Define what exactly needs to be true for this problem to be solved and what constitutes a passing grade for this definition both at the program and for each workstream level. The definition of done for the program is specifically what the stakeholders group is responsible to deliver on. The definition of done for each workstream is what the program contributors are responsible to deliver. Program Leads are responsible to manage both. It includes

  • a checklist for each team to complete their contribution
  • a performance baseline 
  • a resiliency plan and gameday
  • complete internal documentation
  • external documentation. 

Defining these expectations early and upfront makes it clear for all contributors and workstream leads what their objective is while also supporting strong parallelization across the team.

Top Risks and Mitigation

What are the top realistic things that could prevent you from attaining the definition of done, and what’s the plan of action to de-risk them? This is an active evolution. As soon as you mitigate one, it's likely another is nipping at your heels. These risks can be and often are technical, but sometimes they’re resource risks if the program doesn’t get the staffing needed to start on the planned schedule.

Path to Done

You know where you are and you know where you need to be. How are you going to get there and as a result and what’s the timeline? Based on the goals for the program’s definition of done, project scope or staffing become the levers to manipulate the plans by holistically looking at the other company initiatives to ensure total company success and not over optimizing for a single program.

These assets become the framework for aligning, staffing, and executing the program with the Program Stakeholders who use all these assets to decide exactly how the Program is executed, what it will achieve exactly, how much staff is needed and for how long. With this clarity, then it's a matter of execution that is one of the most nebulous parts of large, high stakes programs, and there’s no perfect formula. Eisenhower said, “Plans are worthless, but planning is indispensable.” He’s so accurate. It's the implementation of the plan that creates value and gets the company to the North Star.

Executing the Program

Here’s what works for my team of 200 that crosses nine sub-organizations that each have their own senior leadership team, roadmap, and goals.

Within Shopify R&D, we run in six week cycles company wide. The Definition of Done defined as part of the Program Plan acts as a primary reinforcing loop of the program until the goal is attained, that is, have we reached the North Star? Until that answer is true, we continue working in cycles to deliver on our program. Every new cycle factors in:

  • unachieved goals or regressions from the previous cycle 
  • any new risks that need to be addressed 
  • the goals expected based on the Path to Done. 

Each cycle kicks off with clarity on goals, then moves to get it done or in Shopify terms: GSD (get shit done).

Flow diagram showing the inputs considered to start a cycle and begin GSD to the inputs needed to attain the definition of done and finally implementation
Six week cycle structure

This execution structure takes time to absorb and learn. In practice, it's not this flow of working or even aligning on the Program Plan that’s the hardest part, rather it’s all the things that hold the program together along the way. It's how you work within the cycle and what rituals you implement that are critical aspects of keeping huge programs on track. Here’s a look at the rituals my team implements within our 200 person and 92 project program. We have many types of rituals: Weekly Rituals, Cycle Rituals performed every six weeks, and ad hoc rituals.

Weekly Rituals

Company Wide Program Update

Frequency: The Company Wide Program Update is done weekly. We like the start of the week reflecting on the previous.

Why this is important: The update informs stakeholders and others on progress in every active workstream based on our goals. It supports Program Leads by providing an async, detailed pulse of the project.

Trigger: A recurring calendar item that’s scheduled in two events:

  1. Draft the update.
  2. Review the update. 

It holds us accountable to ensure we communicate on a predictable schedule.

Approach: Programs Leads have a shared template that adds to the predictably our stakeholders can expect from the update. The template matches the format used cycle to cycle, but specifically the exact language and layout as the cycle goals presented in the Program Stakeholder Review. A consistent format allows everyone to interpret the plan easily.

We follow the same template every week and update the template cycle to cycle as the goals for the next cycle changes. This is prioritized on Monday mornings. The update is a mechanism to dive into things like the team’s risks, blockers, concerns, and celebrations.

Throughout the day, Program Leads collaborate on the document identifying tripwires that signal our ad hoc rituals or unknowns that require reaching out to teams. If the Program Leads are able to review and seek out any missing information by the review meeting, we cancel and get the time back. If not, we have the time and can wrap the document up together.

Deliverable: A weekly communication delivered via email to stakeholders on the status against the current cycle.

Program Lead and Project Lead Status Check-in

Frequency: Lead and Champion Status Check-in is done weekly at the start of the week to ensure the full team is aligned.

Why this is important: Team Leads and Champions complete project updates by end of day Friday to inform the Company Wide Program Update for Monday. This status check-in is dedicated space, if and when we need it, for cycle updates, program updates, or housekeeping.

Trigger: A recurring calendar item.

Approach: The recurring calendar item has all Project Leads on the attendee list. Often, the sync is cancelled as we finish the Company Wide Program Update. If there are a few missing updates, all but those Leads are excused from the sync. By completing the Company Wide Program Update, Program Leads identify which projects are selected for a Program Touch Base ritual.

Deliverable: Accurate status of each project and likelihood to reach the as defined cycle goal. It  informs the weekly Company Wide Program Update.

Escalation Triage

Frequency: Held at a minimum weekly, though mostly ad hoc.

Why this is important: This is how we ensure we’re removing blockers, making decisions, and mitigating risks that could affect our velocity.

Trigger: A recurring calendar item called Weekly Program Lead Sync.

Approach: A GitHub project board is used to manage and track the Program. Tags are used to sort urgency and when things need to be done. Decisions are often the outcome of escalations. These are added to the decision log once key stakeholders are fully aligned.

Escalations are added by Program Leads as they come up allowing us to put them into GitHub with the right classifications to allow for active management. As Program Leads, we tune into all technical designs, project updates, and team demos for many reasons, but one advantage is we can proactively identify escalations or blockers.

Deliverable: Delegate ownership to ensure a solution is prioritized among the program team. The escalations aggregate into relevant items in the Program Stakeholder Review as highlights of blockers or solutions to blockers.

Risk Triage

Frequency: Held at a minimum weekly, though also ad hoc.

Why this is important: This is also how we ensure we’re removing blockers, making decisions and mitigating risks that could affect our velocity. This is how we proactively clear the runway.

Trigger: A recurring calendar item called Weekly Program Lead Sync.

Approach: In our planning spreadsheet, we have a ranking formula to prioritize the risks. This means we’ve identified what risks that need mitigation first, where the risk lives within the entire program, and who’s the Lead that’s assigned a mitigation strategy. We also include a last updated date to the status of the mitigation. This allows us to jump in and out at any cadence without accidentally putting undue pressure on the team to have a strategy immediately or poking them repeatedly. The spreadsheet shows who has had time to develop a mitigation strategy and allows us to monitor its implementation. Once there’s a plan, we update the sheet with the mitigation plan and status of implementation. It’s only once the plan is implemented that we change the ranking and actually mitigate our top risks.

Updating and collaborating is done with comments in the spreadsheet. Between Slack channels and the spreadsheet, you can see progress on these strategies. This is a great opportunity for Program Leads to be proactive and pop in these channels to celebrate and remind the team we just mitigated a big risk. Then, the spreadsheet is updated either by the Team Lead or the Program Lead, depending on who's more excited.

Deliverable: Delegate ownership to ensure a mitigation plan is prioritized among the program team. The escalations aggregate into relevant items in the Program Stakeholder Review as highlights of blockers or solutions to blockers.

Program Lead Sync

Frequency: Held weekly and ad hoc as needed.

Why this is important: This is where the Program Leads to strategize, collaborate, and divide and conquer. Program Leads are accountable for the Program’s success. These Leads partner to run the Program and are accountable to deliver on the definition of done by planning cycle after cycle. They must work together and stay highly aligned.

Trigger: A recurring calendar item.

Approach: We have an automatic agenda for this to ensure we tighten our end of week rituals, but also to stay close to the challenges, risks, and wins of the team. We try to minimalize our redundancy in coverage.  Our agenda starts with three basic bullets:

  • Real Talk: What is on your mind, and what is keeping you up at night. It's how we stay close and ensure we’re partnering and not just coordinated ships in the night. 
  • Demo Plan: What messaging if any should we deliver during the demo since the entire Program team has joined?
  • Divide and Conquer: What meetings can we drop to reduce redundancy. 
  • Risk Review: What are the top risks, and how are the mitigation plans shaping up? 

Throughout the week, agenda items are added by either Program Lead that ensures we have a well rounded conversation about the aspects of the Program that are top of mind for the week. Often these items tend to be escalations that could affect the Program velocity or a Project’s Scope.

Deliverable: A communication and messaging plan for weekly demos where the team fully gathers, risk levels, mitigation plans based on time passed, and new information or tooling changes.

Weekly Demos

Frequency: Held weekly.

Why this is important: Weeks one to five is mainly time for the team to share and show off their hard work and contribution to the goals. Week six is to show off the progress on the planned goals to our stakeholders.

Trigger: Scheduled in the calendar for the end of day on Fridays.

Approach: There are two things that happen in prep for this ritual:

  1. planning demo facilitation 
  2. planning demos. 

Planning demos: Any Champion can sign up for a weekly demo. A call for demos is made about two days in advance, and teams inform their intention on the planning spreadsheet: a weekly check mark if they will demo.

Planning demo facilitation: Program Leads and domain area leadership facilitate the event and deliver announcements, appropriately timed messaging, and demos. Of course, we also have fun and carve out a team vibe. We do jokes and welcome everyone with music. The demos identified are called on one by one to demo, answer team questions and share any milestones achieved.

Deliverable: A recorded session available to the whole company to review and ask further questions. It’s included in the weekly Company Wide Program Updates.

Cycle Rituals Performed Every Six Weeks

Cycle Kick Off

Frequency: Held every new cycle start: day one of week one.

Why this is important: This aligns the team and reminds us what we’re all working towards. We share goals, progress, and welcome new team members or workstreams. It also allows the team to understand what other projects are active in parallel to theirs, allowing them to proactively anticipate changes and collaborate on shared patterns and approaches.

Trigger: A recurring calendar item.

Approach: We host a team sync up, the entire program team is invited to participate. We try to keep it short, exciting, and inspiring. We raise any reminders on things that have changed, like the definition of done and office hours to help repeat the support in place for the whole team.

Deliverable: A presentation to the team delivered in real-time that highlights the cycle’s investment plan, overall progress on the Program and some of the biggest areas of risk the next six weeks for the team.

Mid-Cycle Goal Iteration

Frequency: Held between weeks one and three in a given cycle but no more than once per project.

Why this is important: Goals aren’t always realistic when set, but it's only after starting that it’s realized. Goals aren’t a jail cell, they’re flexible and iterative. Leads are empowered in weeks one to three to change their cycle goal so long as they communicate why and provide a new goal that’s attainable within the remaining time.

Trigger: Week three

Approach: In weeks one to three, Leads leverage Slack to share how their goal is evolving. This evolution and the effect on the subsequent cycles left in the program plan needs to be understood. Leads do this as needed, however in week three there’s a reminder paired with Goal Setting Office Hours.

Deliverable: A detailing of the change in cycle goals since kick off, and its impact on the overall project workstream and program path to be done.

Goal Setting Office Hours

Frequency: Held between weeks three to five in a given cycle. 

Why this is important: In week three, time is carved off for reviewing current cycle goals. In week four and week five, the time is focused on next cycle goals. This is how we build a plan for the next cycle’s goals intentionally rather than rushing once the Program Stakeholder meeting is booked. It's how we’re aligned for the week one kick off.

Trigger: Week three

Approach: This is done with a recurring calendar on the shared program calendar and paired with a sign up sheet. Individuals then add themselves to the calendar invite.

This isn’t a frequently used process, but does give comfort to leads that the support is there and the time is carved off. The Program Touch Base ritual tends to catch risks and velocity changes in advance of Goal Setting Office Hours, but we have yet to determine if they should be removed altogether.

Deliverable: A change in the cycle’s current goal, the overall project workstream, and program path to be done, including staffing changes.

Cycle Report Card

Frequency: Held every six weeks.

Why this is important: This is a moment of gratitude and reflection on what we've achieved, and how we did so together as a team.

Trigger: Week Six

Approach: In week five, Slack reminds Leads to start thinking about this. Over the next week, the team drips in nominations to highlight some of the best wins from the team on performance and values we hold such as being collaborative, merchant/partner/developer obsessed, and resourceful.

This is done in a templated report card where we reflect back on what we wanted to achieve and update the team so they can see the velocity and impact of their work. Then, we celebrate.

This is delivered and facilitated by Program Leads where Team Leads are the ones delivering the praise in a full team sync up. We believe this not only helps create a great team and working environment, but also helps demonstrate alignment among the Program Leads. It helps us all get to know our large team and strengths better.

Deliverable: A celebratory section in the Cycle Kick off presentation reflecting back on individual contributions and team collaborations aligned to the company values.

Program Lead Retro of the Previous Cycle

Frequency: Held every six weeks, skipping the first cycle of the program.

Why this is important: This enables recurring optimization of how the Program is run, the rituals and the team’s health. It ensures that we’re tracking towards a great working experience for staff while balancing critical goals for the company. It’s how Program Leads and Project leads get better at executing the Program, leading the team and managing the Program stakeholders.

Trigger: A new cycle in a program. Typically the retro is held in week one after Project Lead’s have shared their Retro feedback.

Approach: This retro is facilitated through a stop start and continue workshop. It’s a simple, effective way to reflect on recent experiences and decide on what things should change moving forward. Decisions are based on what we learned in the cycle, and what we'll to stop doing, start doing, and continue doing?

A few questions are typically added to get the team thinking about aspects of feedback that should be provided

  • How are Program Leads working together as a team?
  • How Program Leads are managing up to Program Stakeholders? 
  • How Project Leads are managing up to Program Leads?
  • What feedback is our Team Leads telling us? 
  • How is the execution of the Program going within each team?

This workshop produces a number of lessons that drive change on the current rituals. Starting in week two, the Lead Sync is held to review and discuss how we’re iterating rituals in this cycle. Program Leads aim to implement the changes and communicate to the broader team by the end of week two so we have four weeks of change in place to power the next cycle’s retro.

Deliverable: A documented summary of each aspect of the retro described above available company wide and included in the Program Stakeholder Company Wide Update.

Project Lead Retro of Previous Cycle

Frequency: Held every six weeks, skipping the first cycle of the program.

Why this is important: Project Leads have the option to run the retro as part of their rituals.

This enables recurring optimization of how a Project is run within the Program, the rituals, and the team’s health. It’s how Project Leads get better at executing Projects, leading the team, and working within a larger Program.

Trigger: A new cycle in a program.

Typically the retro is held in week six or week one while the cycle is fresh. Even if the Project Lead has decided not to run a retro, they still may at the request of a Program Lead.

Approach: Project Leads are not prescribed an approach beyond the general Get Shit Done recommendations that already exist within Shopify. The main point of the exercise is not how it's run, but the outcome of the exercise.

Program Leads share an anonymous feedback form in advance of week six. This asks for what the team is going to stop, start and continue but also at the Program level. Then, we include an open ended section to distill lessons learned. These lessons are shared back with all Project Leads so we’re learning from each other. This generates a first team vibe for all Project Leads who have teams contributing to the program. First team is a concept from Patrick Lencioni where true leaders prioritize supporting their fellow leaders over their direct reports.

It’s important for teams who want to go far and fast as this mindset is transformational in creating high performing teams and organizations. This is because a strong foundation of trust and understanding makes it significantly easier for the team to manage change, be vulnerable, and solve problems. At the end of the day, ideas or plans don’t solve problems; teams do.

Deliverable: Iteration plan on the rituals, communication approaches, and tooling that continues to remove as many barriers and as much complexity from the team’s path.

Program Stakeholder Review

Frequency: Held every six weeks, often in early week six.

Why this is important: This is where Program Stakeholders review the goals for the upcoming cycle, set expectations, escalate any risks necessary, or discuss scope changes based on other goals and decisions. This is viewed in context to the cycle ahead, but also the overall Program Plan.

Trigger:  Booked by the VP office.

Approach: Program Leads provide a status update of the previous cycle and the goals for the upcoming cycle in visual format. Program Leads leverage the Weekly Sync to make a plan on how we’d like to use this time with the stakeholders so we’re aligned on the most important discussion points and can effectively use the Program Steering Committee's time.

Deliverable: A presentation that highlights progress, the remaining program plan, open decisions, and escalations that require additional support to manage.

Program Stakeholder Company Wide Update

Frequency: We aim to do this at least once a cycle, often at the beginning of week four right in time to clarify the program changes following Mid-Cycle Goal Iteration.

Why is this important: Shopify is a very transparent company internally, it's one of the ways we work that allows us to move so fast. Sharing the Program Status and the evolution cycle to cycle creates an intense collaboration environment, ensuring teams have the right information to spot dependencies and risks as an outcome of this program. It supports Program Leads as well by helping clarify exactly where their team fits in the larger picture by providing an async, detailed pulse of the program.

Trigger: A recurring calendar item that’s scheduled in two events:

  1. Draft the update.
  2. Review the update. 

It holds us accountable to ensure we communicate on a predictable schedule.

Approach: Programs Leads have a shared template that adds to the predictably our stakeholders can expect from the update. The template matches the overall program layout, specifically the exact language and layout as the Program was framed at the time of kickoff.  A consistent format allows everyone to interpret the plan easily. The update is a mechanism to dive into things like the program risks, blockers, concerns, and milestones.

Throughout the day, Program Leads collaborate on the document identifying areas that could use more attention and support, highlighting changes to the overall plan, updating forecasting numbers, and most often, celebrating being on track!

Deliverable: A communication delivered via email to stakeholders on the status of the overall program.

Ad Hoc Rituals

The ad hoc rituals are the ones that somehow hold the whole thing together, even through the most turbulent situations. They are the rituals triggered by intuition, experience, and context that pull out the minor technical and operational details that have the potential to significantly affect the scope, trajectory or velocity of the Program. These rituals navigate the known unknowns and unknown unknowns hidden in every project in the program.

Assumption Slam Workshop

Frequency: Held ad hoc, but critical within the first month of kick off.

Why this is important: The nature of these programs is a complex intersection of Product, UX, and Engineering. This is a workshop to align the team and decrease unclear communications or decisions rooted in assumptions. This workshop is a mechanism to surface those assumptions, and the resulting lift to ensure this is well managed and doesn’t become a blocker.

Trigger: Ad hoc

Approach: In weeks one to three the Program Leads facilitate a guided workshop that we call an Assumption Slam. The group should be small as you’ll want to have a meaningful and actionable discussion. The workshop should be facilitated by someone who has enough context on the program to ask questions that lead the team to the root of the assumption and underlying impacts that require management or mitigation. You’ll also want to ensure the right team members are included to ensure you are challenging plans at the right intersections.

Deliverable: The key items identified in this section shift to action items. Mitigate the risk, finalize a decision, or complete a deeper investigation.

Program Touch base

Frequency: Ad Hoc

Why this is important: This is a conversational sync allowing the Project Lead to flag anything they feel relevant. This is how we stay highly aligned with the Leads and help them stay on course as much as possible.

Trigger: If something doesn’t seem right like:

  • A workstream's goals are off track for more than one week in a row and we haven’t heard from them.
  • A workstream's goals status moves from green to red without being in yellow.
  • A workstream isn’t making their team updates on a regular cadence.
  • A workstream’s Lead hasn't talked with us in a full cycle.

Otherwise it’s triggered, if we have new information that we need to talk about like another initiative that affects scope or dependencies or staffing changes.

Approach: We leverage few here and call the meeting Program Touch Base. Once that’s done, an agenda is automatically added with the following items:

  • Real Talk: What is on your mind and what is keeping you up at night. It's how we stay close and ensure we’re partnering and not just coordinated ships in the night. 
  • Confidence in cycle and full workback: 
    • Based on the goal you have for this cycle, are you confident you can deliver on it? 
    • What about your Full schedule for the program? 
    • What is your confidence in that estimate including time, staffing and scope?
  • What challenges or risks are in your way?
  • Performance: Speed, Scale, Resiliency: 
    • How is the performance of your project shaping up? 
    • Any concerns worth nothing that would risk you attaining the definition of done for your workstream?
  • What aren’t you doing? Program stakeholders typically will inherit any technical debt of decisions. By asking this, Project Leads can identify items for the roadmap backlog.

Deliverable: This engagement often leads to action items such as dependency clarification, risk identification and decision making.

Engineering Request for Comments (RFC)

Frequency: Held ad hoc but critical during technical design phases or after performance testing analysis.

Why is this important: Technical Design is good for rapid consensus building. In Engineering Programs, we need to align quickly on small technical areas, rather than align on the entire project direction. There’s significant overlap between the changes being shipped and the design exploration.

Trigger: Ad hoc

Approach: Using an async-friendly approach in GitHub, Engineers follow a template and rules of engagement. If alignment isn’t reached by the deadline date and no one has explicitly “vetoed” the approach, how to proceed becomes the decision of the RFC author.

Deliverable: A technical, architectural decision that is documented.

Performance Testing

Frequency: Held ad hoc, on the component and integrated into the system.

Why is this important: This is critical to the Program and to Shopify. It's a core attribute of the product and minimally, can’t be regressed on. It, however, can also be improved on. Those improvements are key call outs used in the Cycle Report Card.

Trigger: Deploying to Production.

Approach: Teams design a load test for their component by configuring a shop and writing a Lua script that’s orchestrated through our internal tooling named Genghis. Teams validate the performance against the Service Level Indicators we are optimizing for as part of the program, and if it’s a pass, aim to fold their service into a complete end to end test where the complexity of the system will rear its head.

This is done through async discussion as well as office hours hosted by the Program’s performance team. The Performance team documents the context being shared and inherits the end to end testing flows and associated shops. Yes, multiple shops and multiple flows. This is because services are tested at the happy path, but also with the maximum complexity to understand how the system behaves, and what to expect or fix.

Deliverable: First and foremost, it's a feedback loop validating to teams that the service meets the performance expectations. Additionally, the Performance team can now run recurring tests to monitor for regressions and stress on any dimension desired.

Engineering Program Management is still an early craft and evolves to the specific needs of the program, organization of the company, and management structure among the teams involved. We hope a glimpse into how we’ve run a 200+ person engineering program of 90 projects helps you define how your Engineering Program ought to be designed. As you start that journey, remember that not all rituals are necessary. In our experience, we find they’re important to attaining the objectives as close as possible and doing so with a happy and healthy team. It’s the combo of all of these calendar-based and ad hoc rituals that have allowed Shopify to achieve our goals quarter after quarter.

You heard directly about some of these outcomes at Unite 2021: custom storefronts, checkout app extensions, and Online Store 2.0.

We’d love to hear how you are running engineering programs and how our approaches contrast! Reach out to us on Twitter at @ShopifyEng.

Carla Wright is an Engineering Program Manager with a focus on Scalability. She's been at Shopify for five years working across the organization to guide technical transformation, focusing on product and infrastructure bottlenecks that impact a merchant’s ability to scale and grow their business.


We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Continue reading

Perspectives on React Native from Three Shopify Developers

Perspectives on React Native from Three Shopify Developers

By Ash Furrow, AJ Robidas, and Michelle Fernandez

From the perspective of web developers, React Native expands the possibilities of what a developer can create. Using the familiar React paradigm, a web developer can build user interfaces for iOS and Android apps (among other platforms). With React Native, web developers can use familiar tools to solve unfamiliar problems—what a delight!

From the perspective of native developers, both Android and iOS, React Native (RN) helps them build user interfaces much faster. And since native developers typically focus on either Android or iOS (but usually not both), React Native also offers a wider audience for developers to reach with less effort.

As we can see, React Native offers benefits to both web and native mobile developers. Another benefit is that developers get to work together with programmers from other backgrounds. Web, Android, and iOS developers can work together using React Native in a way that they couldn’t before. We know that teams staffed with a variety of backgrounds perform better than monocultures, so the result of using React Native can be better apps, built faster, and for a wider audience. Great!

Even as we see the benefits of React Native, we also need to acknowledge the challenges and concerns from developers of web and native backgrounds. It’s our hope that this blog post (written by a web developer, an Android developer, and an iOS developer) can help contextualize the React Native technology. We hope that by sharing our experiences learning React Native, we can help soothe your anxieties and empower you to explore this exciting technology.

Were You Excited to Start Using React Native?

AJ: Yes definitely. Coming from a web dev background, I was always interested in mobile development, but had no interest in going back to Java just to make android apps. I had been using React for a while already so when I heard there was a way to write mobile apps using React I was immediately interested, though I struggled to get into it on my own because I learn better with others. When I was offered a job working with React Native, I jumped at the opportunity

Michelle: I was hesitant and thought that all that I knew about native android development was going to be thrown out the window! The easier choice would have been to branch off and stay close to my native development roots doing iOS development, but I’m always up for a challenge and saying YES to new things.

Ash: I wasn’t! My previous team started using it in 2015 and I resisted. I kind of stuck my head in the sand about it because I wanted to use Swift instead. But since the company didn’t need a lot of Swift to be written, I started working on web back-ends and front-ends. That’s when I learned React and got really excited. I finally understood the value in React Native: you get to use React.

What surprised you most about React Native?

AJ: The simplicity of the building blocks. Now I know that sounds a little crazy, but in the web world there are just so many base semantic elements. <button>, <a>, <h1> to <h6>, <p>, <input>, <footer>, <img>, <ol>, etc. So when I started learning React Native, I was looking for the RN equivalents for all of these, but it just isn’t structured the same way. RN doesn’t have heading levels and paragraphs, all text is handled by the <Text> component. Buttons, links, tabs, checkboxes, and more can all be handled with <Touchable> components. Even though the structure of writing a custom component is almost exactly the same as React, it feels very different because the number of semantic building blocks goes from more than 100 down to a little more than 20.

Michelle: I was surprised at how quick it was to build an app! The instant feedback when updating the UI is incomparable to the delay you get with native development, and the data that informs that UI is easy to retrieve using tools like GraphQL and Apollo. I was also very surprised at how painless it was to create the native bridge module and integrate existing SDKs into the app and then using those methods from the JavaScript side. The outcome of it all is a solid cross-platform app that still allows you to use the native layer when you need it! (And it’s especially needed for our Point of Sale app)

Ash: I was surprised by how good a React Native app could be. Previous attempts at cross-platform frameworks, like PhoneGap, always felt like PhoneGap apps. They never felt like they really belonged on the OS. Software written in React Native is hard to distinguish from software written in Swift or Objective-C. I thought that the value proposition of React Native was the ability to write cross-platform apps with a single codebase, but it was only used on iOS during the five years I used it at Artsy. React Native’s declarative model is just a better way to create user interfaces, and I think we’ve seen the industry come to this consensus as SwiftUI and Jetpack Compose play catch-up.

Let’s start by exploring React Native from a web developer’s perspective. React Native uses a lot of the tooling that you, as a web developer, are already familiar with. But it also uses some brand new tools. In some ways, you might feel like you’re starting from scratch. You might struggle with the new tools to accomplish simple tasks, and that’s normal. The discomfort that comes from feeling like a beginner is normal, and it’s mitigated with help from your peers.

Android Studio and Xcode can both be intimidating, even for experienced developers who use them day-to-day. Ideally, your team has enough Android and iOS developers to build solid tooling foundations and to help you when you get stuck. At Shopify, we rarely use the Android Studio and Xcode IDEs to write React Native apps. Instead, we use Visual Studio Code for most of our work. Our mobile tooling teams created command line abstractions for tools like adb, xcodebuild, and xcrun. So you could clone a React Native repository and get a simulator running with the code without ever opening Android Studio or Xcode.

What was the most challenging part about getting used to RN?

AJ: For me it was the uncertainty. I came in confident in my React skills, but I found myself never knowing what mobile specific concerns existed, and when they might come into play. Since everything needs to be run over the RN Bridge, some aspects of web development, like CSS animations for example, just don’t really translate in a way that’s performant enough. So with no mobile development background any of those mobile specific concerns were an blind spot for me. This is an area where having coworkers from a mobile background has been a huge benefit.

Michelle: Understanding the framework and metro server and node and packages and hooks and state management and and and, so... everything?! Although if you create analogies to native development, you’ll find something similar. One quote I like is: “You’re not starting from scratch, you’re starting from experience.” This helps me to put in perspective that although it’s a new language and framework and the syntax may be different—the semantics are the same, meaning that if I wanted to create something like I would using native android development, I just had to figure out how I could achieve the same thing using a bit of JavaScript (TypeScript) and if needed, leverage my native development skills and the React bridge to do it.

Ash: I was really sceptical about the node ecosystem, it felt like a house of cards. Having over a thousand dependencies in a fresh React Native project feels… almost irresponsible? At least from a native background in Swift and Objective-C. It’s a different approach to be sure, to work around the unique constraints of JavaScript. I’ve come to appreciate it, but I don’t blame anyone for feeling overwhelmed by the massive amount of tools that your work sits atop of.

Your experience as a web developer offers a perspective on how to build user interfaces that’s new to native developers. While you may be familiar with tools like node and yarn, these are very different from the tools that native developers are used to. Your familiarity, from the basics of JSX and React components to your intuition of React best practices and software patterns, will be a huge help to your native developer colleagues.

Offer your guidance and support, and ask questions. Android and iOS developers will tend to use tools they are already familiar with, even if a better cross-platform solution exists. Challenge your teammates to come up with cross-platform abstractions instead of platform-specific implementations.

What do you think would be painful about RN but turned out to be really friendly?

AJ: That's a difficult question for me, I didn’t really have anything in particular that I expected to be painful. I can say that the little bit I tried to learn RN on my own before I started at Shopify, I personally found getting the simulators and emulators set up to be painful. I was grateful when I got started here to find that the onboarding documentation and RN tutorial helped me breeze through the setup way faster than expected. I was up and running with a test app in the simulator within minutes that let me actually start learning RN right away instead of struggling with the tech.

Michelle: Coming from a native background using a powerhouse of an IDE, I thought the development process would slow me down. Thankfully, I’ve got my IDE (IntelliJ IDEA) now set up so that I can write code in React and at the same time write and inspect native kotlin code. You’d never think that a good search and replace and refactoring system would speed up your dev process by 10x but it really does.

Ash: I was worried that writing JavaScript would be painful, because no one I knew really liked JavaScript. At the time, CoffeeScript was still around, so no one really liked JavaScript, especially iOS developers. But instead, I found that JavaScript had grown a lot since I’d last used it. Furthermore, TypeScript provided a lot of the compile-time advantages of Swift but with a much more humane approach to compilers. I can’t think of a reason I would ever use React Native without TypeScript, it makes the whole experience so much more friendly.

Next, let’s explore the Android and iOS perspectives. Although the Android and iOS platforms are often seen to have an antagonistic relationship with one another, the experiences of Android and iOS developers coming to React Native are remarkably similar. As a native developer, you might feel like you’re turning your back on all the experience you’ve gained so far. But your experience building native applications is going to be a huge benefit to your React Native team! For example, you have a deep familiarity with your platform’s user interface idioms; you should use this intuition to help your team build user interfaces that “feel” like native applications.

What do you wish was better about working in RN?

AJ: Accessibility. I’m a huge accessibility advocate, I push for implementation of accessibility in anything I work on. But this is a bit of a struggle with React Native. Accessibility is an area of RN that doesn’t yet have a lot of educational material out there. A lot of the principles for web still apply, but the correct way to implement accessibility in some areas isn’t yet well established and with fewer semantic building blocks very little gets built in by default. So developers need to be even more aware and intentional about what they create.

Michelle: React Native land seems like the wild wild west after coming from languages with well established patterns and libraries as well as the documentation to support it. These do currently exist for RN but because of how new this framework is and the slow (but increasing!) adoption of it, there's still a long way to go to make it accessible for more people by providing more examples and resources to refer to.

Ash: I wish that the tools were more cohesive. Having worked in the Apple developer ecosystem for so long, I know how empowering a really polished developer tool can be. Not that Apple’s tools are perfect, far from it, but they are cohesive in a way that I miss. There’s usually one way to accomplish a task, but in React Native, I’m often left figuring things out on my own.

React Native apps are still apps and, consequently, they operate under the same conditions as native apps. Mobile devices have specific constraints and capabilities that web developers aren’t used to working with. Native developers are used to thinking about an app’s user experience as more than only the user interface. For example, native developers are keenly aware of how cellular and GPS radios impact battery life. They also know the value of integrating deeply with the operating system, like using rich push notifications or native share sheets. The same skills that help native developers ensure apps are “good citizens” of their platform are still critical to building great React Native applications.

When did opinions about React Native change?

AJ: I’m not sure I’d say I’ve had a change of opinion. I went into React Native curious and uncertain of what to expect. I’d heard good things from other web devs that I knew who had tried RN. So I felt pretty confident that I’d be able to pick it up and that I would enjoy it. If anything I would say the learning process went even smoother than anticipated.

Michelle: My opinions changed when I found that a React Native app allows us to integrate with the native SDKs we've developed at Shopify and still results in a performant app. I realized that Kotlin over the React bridge works and performs well together and still keeps up my skills in native Android development.

Ash: They changed when I built my first feature from the ground-up in React, for the web. The component model just kind of “clicked” for me. The next time I worked in Swift, everything felt so cumbersome and awkward. I was spending a lot of time writing code that didn’t make the software unique or valuable, it was just boilerplate.

Native developers are also familiar with mobile devices’ native APIs for geofencing, augmented reality, push notifications, and more. All these APIs are accessible to React Native applications, either through existing open source node modules or custom native modules that you can write. It’s your job to help your team make full and appropriate use of the device’s capabilities. A purely React Native app can be good, but it takes collaborating with native developers to make an app that’s really great.

How would you describe your experiences with React Native at Shopify?

AJ: I’ve had a great experience working with React Native at Shopify. I came in as a React dev with absolutely no mobile experience of any kind. I was pointed towards a coworker’s day long “Introduction to React Native” workshop, and it gave me a better understanding than I’d gotten from the self learning I’d attempted previously. On top of that, I have knowledgeable and supportive coworkers that are always willing to take the time out of their day to lend a hand and help fill in the gaps. Additionally the tooling created by the React Native Foundations team takes away the majority of the pain involved in getting started with React Native to begin with.

Michelle: Everything goes at super speed at Shopify—this includes learning React Native! There are so many resources within Shopify to support you including internal workshops providing a crash course to RN. Other teams are also using RN so there’s opportunity to learn from them and the best practices they’re following. Shopify also has specific mobile tooling teams to support RN in our CI environment and automation to ship to production. In addition to the mobile tooling team, there’s a specific React Native Foundations team that builds internal tools to help others get familiar and quickly spin up RN apps. We have monthly mobile team meetups to share and gain visibility into the different mobile projects built across Shopify.

Ash: I’m still very new to the company, but my experience here is really positive so far. There’s a lot of time spent on the foundations of our React Native apps—fast reload, downloadable bundles built for each pull request, lint rules that keep developers from making common mistakes—that all empower developers to move very, very quickly. In React Native, there is no compile step to interrupt a developer’s workflow. We get to develop at the speed of thought. Since Shopify invests so much in developer tooling, getting up to speed with the Shop app took no time at all.

Learning anything new, including RN, can feel intimidating, but you can learn RN. Your existing skills will help you learn, and learning it is best done in a team environment with many perspectives (which we have at Shopify, apply today!).

We see now that both web developers and native developers have different perspectives on building mobile apps with React Native, and how those perspectives complement each other. React Native teams at Shopify are generally staffed with developers from web, Android, and iOS backgrounds because the teams produce the best software when they can draw from these perspectives.

Whether you’re a web developer looking to expand the possibilities of what you can create, or you’re a native developer looking to move faster with a cross-platform technology, React Native can be a great solution. But just like any new skill, learning React Native can be intimidating. The best approach is to learn in a team environment where you can draw from the strengths of your colleagues. And if you’re keen to learn React Native in a supportive, collaborative environment, Shopify is hiring! You can also view a presentation on How We Write React Native Apps at Shopify from our Shipit! Presents series.

AJ Robidas is a developer from Ontario, Canada, with a specialization in accessibility. She has a varied background from C++ and Purl, some Python backend work, to multiple web frameworks (AngularJS, Angular, StencilJS and React). For the past year she has been a React Native developer on the Shop team implementing new and updated experiences for the Shop App

Michelle Fernandez is a senior software developer from Toronto, Canada with nearly a decade of experience in the mobile applications world. She has been working on Shopify’s Android Point of Sale app since its redesign with Shopify Polaris and has contributed to its rebuild as a React Native app from inception to launch. The Shopify POS app is now in production and used by Shopify merchants around the world.

Ash Furrow is a developer from New Brunswick, Canada, with a decade of experience building native iOS applications. He has written books on software development, has a personal blog, and currently contributes to the Shop team at Shopify.

Continue reading

Shopify-Made Patterns in Our Rails Apps

Shopify-Made Patterns in Our Rails Apps

At Shopify, we’ve developed our own patterns in order to support our global platform. Before coming here, I've developed multiple Ruby (and Rails) applications at multiple growth stages. Because of that, I quickly came to appreciate some workarounds and automation that were created to support the large codebase of Shopify.

If there’s something I appreciate about Ruby on Rails, it’s the principle of convention over configuration it’s been built with. This enables junior developers to build higher quality code than in other languages, simply by following conventions. Conventions are also great when moving to a new Rails application: the file structure is always familiar.

But this makes it harder to go outside conventions. When people ask me about the biggest challenges of Ruby, I usually say it’s easy to start, but hard to become an expert. Everything is so abstracted, so one must be really curious and take the time to understand how Ruby and specifically Rails actually work.

Our monolith, Shopify Core, employs many of the common Rails conventions. This ranges from the default application structure, to the usage of in-built libraries like the Active Record ORM, Active Model, or Ruby gems like FrozenRecord.

At Shopify, we implement what most merchants need, most of the time. Similarly, the Rails framework also provides the infrastructure that most developers need, most of the time. Therefore, we had to find creative ways to make the largest Rails monolithic application maintainable.

When ready to join Shopify as a developer, my goal is that this blog post is useful to you whether you are new to Ruby, or if you’ve worked with Ruby on other projects in the past.

Dev

I would like to give the first mention to our command line developer tool, dev. At Shopify, we have thousands of developers working on hundreds of active projects. Many of these projects,in the past, had their own workflows and instructions on setup, how to run tests, and so on.

We created dev to provide us with a unified workflow across a variety of projects. It gives us a way to specify and automate the installation of all the dependencies and includes the workflow items required to boot the project on macOS, from XCode to bin/rails db:migrate. This is probably the first Shopify-made infrastructure you’ll use when starting at Shopify. It’s easy to take it for granted, but dev is doing so much towards increasing our productivity.

Time is money and automations are one time efforts.

We believe consistency is important across development environments. Inconsistencies can lead to debugging nightmares and incorrect local behaviour. Even with the existing tools like chruby, bundler, and homebrew to manage dependencies, setup can be a multi-step tedious process, and it can be difficult to outline the processes that achieve the desired consistency. So, we standardise many of the commands we use at Shopify through dev.

One of the most powerful features of dev is the ability to spin up services, in multiple programming languages. That means each repo has the same base configuration, structure, and libraries. Our infrastructure team is constantly working to make dev better to ultimately increase developer productivity. Dev also abstracts environment variables. Whenever joining smaller companies, one would spend days “fishing” environment variables before getting a few connected systems up and running.

Dev also enables Shopify developers to enable and disable integrations with interconnected services. This is usually manually changed through environment variables or configuration types.

Lastly, dev even abstracts command aliases! Ruby is already pretty good on commands, but when looking at tools, the commands can get super long. And this is where aliases help us developers save time, as we can make shortcuts for longer commands. So Shopify took this to the next level: why let developers set up their environment if they can get a baseline configuration, right through dev? This also helps standardise commands across projects, regardless of the programming language. For example, before I'd use the Hub package for opening PR’s. Now, I just use dev open pr.

Pods

Shopify core has a podded architecture, which means that the database is split into an arbitrary number of subsets, each containing a distinct subset of shops. Each pod runs Shopify independently, with a database containing a portion of our shops. The concept is based on the shard database infrastructure pattern. The Rails framework already has the pod/shard structure built-in. It was implemented with Shopify’s usage in mind and in collaboration with Github. In comparison with the shard database pattern, we’re expanding it to the full infrastructure. That includes provisioning, deployment, load balancers, caching, and servers. If one pod shuts down temporarily, the other pods aren’t affected. If you’d like to learn more about the infrastructure behind this, check out our blog post about running Kafka on Kubernetes at Shopify.

Horizontally scaling out our monolith was the fastest solution to handling our load.

Shopify is not just a software as a service company. It’s a platform able to generate full websites for over 1.7 million merchants. Whenever we deliver our services to merchants, we look at data in the context of the merchant's store. And that’s why we split everything by shop, including:

  • Incoming HTTP requests
  • Background jobs
  • Asynchronous events

That’s why every table in a podded database is connected to a shop. The shop is necessary for podding—our solution for horizontal scaling. And the link helps us avoid having data leaks between shops.

For a more detailed overview of pods, check out A Pods Architecture to Allow Shopify to Scale.

Domain Driven Design

At Shopify, we love monoliths. The same way microservices have their challenges, so do monoliths, but these are challenges we're excited to try and solve.

Splitting concerns became a necessity to support delivery in our growing organization.

Monoliths can serve our business purpose very well—if they aren’t a mess. And this is where domain driven architecture comes into place. This concept wasn’t invented by Shopify, but it was definitely tweaked to work in our domain. If you’d like to learn more about how we deconstructed our monolith through components, check out Deconstructing the Monolith: Designing Software that Maximizes Developer Productivity and Under Deconstruction: The State of Shopify’s Monolith.

We did split our code in domains, but that’s about all we split. Traditionally, we’d see no link between domains besides public or internal APIs. But our database is still common for all domains, and everything is still linked to the Shop. This means we’re breaking domain boundaries every time we call Shop from another domain. As mentioned earlier, this is a necessity for our podded architecture. This is where it becomes trickier: every time we’re instantiating a model outside our domain, we’re ignoring component boundaries and we receive a warning for it. But, because the shop is already part of every table, the shop is practically part of every domain.

Something else you may be surprised by is we don’t enforce any relationships between tables on the database layer. This means the foreign keys are enforced only at the code level through models.

And, even though we use ActiveRecord migrations (not split by pods), running all historical migrations wouldn’t be feasible. Because of that, we only use migrations in the short term. Every month or so, we merge our migrations in a raw sql file which holds our database structure. This avoids the platform running migrations for hours, aging back 10 years. This blog post, Pros and Cons of Using structure.sql in Your Ruby on Rails Application, explains in more detail the benefits of using a structure.sql file.

Standardizing How We Write code

We expect to hire over 2000 this year. How can we control the quality of the code written? We do it by detecting repetitive mistakes. There are so many systems Shopify created to address this, ranging from gems to generators.

We built safeguards to keep quality levels up in a fast scaling organization.

One of the tools often used that’s implemented by us is the translation platform: a system handling creation, translation, and publication of translations directly through git.

In smaller companies, you’d just receive translations from the marketing team to embed in the app, or just get it through a CRM. This is certainly not enough when it comes to globalizing such a large application. The goal is to enable anyone to release their work while translations are being handled asynchronously, and it definitely saves us a lot of time. All we need to do is push the English version, and all the strings are automatically sent to a third party system where translators can add their translations. Without any input from the developers, the translations are directly committed back in our repos. The idea was first developed during Shopify hack days back in 2017. To learn more, check out this blog post about our translation platform.

Our maintenance task system also deserves a memorable mention. It’s built over the rails Active Job library, but has been adapted to work with our podded infrastructure. In a nutshell, it’s a Rails engine for queuing and managing maintenance tasks. In case you’d like to look into it, we’ve made this project open source.

In our monolith, we’ve also set up tons of automatic tests letting us know when we’re taking the wrong approach, and limits were put in to avoid overloading our system when spawning jobs.

Another system that standardizes how we do things is Monorail. Initially inspired by Airbnb Jitney, Monorail enforces schemas for widely used events. It creates contracts between Kafka producers and consumers through a defined structure of the data sent through JSON. Some benefits are

  1. With unstructured events, events with different structure would end up as part of the same data warehouse table. Monorail creates a contract between developers and data scientists through schemas. If it changes, it has to be done through versioning.
  2. It also avoids Personal Identifiable Information (PII) leaks. Schemas all go through privacy review, ensuring PII fields are annotated as such, and are automatically scrubbed (redacted, tokenized).

I’ve covered many different topics herein this introduction to all of the awesome features we’ve set up to increase our productivity levels and focus on what matters: shipping great features. If you decide to join us, this overview should give you enough background to help you take the right approach at Shopify from the beginning.

Ioana Surdu-Bob is a Developer at Shopify, working on the Shopify Payments team. She’s passionate about personal finance and investing. She’s trying to help everyone build for financial independence through Konvi, a crowdfunding platform for alternative assets.


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

Shopify's Path to a Faster Trino Query Execution: Infrastructure

Shopify's Path to a Faster Trino Query Execution: Infrastructure

By Matt Bruce & Bruno Deszczynski

Driving down the amount of time data scientists are waiting for query results is a critical focus (and necessity) for every company with a large data lake. However, handling and analyzing high-volume data within split seconds is complicated. One of the biggest hurdles to speed is whether you have the proper infrastructure in place to efficiently store and query your data.

At Shopify, we use Trino to provide our data scientists with quick access to our data lake, via an industry standard SQL interface that joins and aggregates data across heterogeneous data sources. However, our data has scaled to the point where we’re handling 15 Gbps and over 300 million rows of data per second. With this volume, greater pressure was put on our Trino infrastructure, leading to slower query execution times and operational problems. We’ll discuss how we scaled our interactive query infrastructure to handle the rapid growth of our datasets, while enabling a query execution time of less than five seconds.

Our Interactive Query Infrastructure 

At Shopify, we use Trino and multiple client apps as our main interactive query tooling, where the client apps are the interface and Trino is the query engine. Trino is a distributed SQL query engine. It’s designed to query large data sets distributed over heterogeneous data sources. The main reason we chose Trino is that it gives you optionality in the case of database engine use. However, it’s important to note that Trino isn’t a database itself, as it’s lacking the storage component. Rather, it's optimized to perform queries across one or more large data sources.

Our architecture consists of two main Trino clusters:

  • Scheduled cluster: runs reports from Interactive Analytics apps configured on a fixed schedule.
  • Adhoc cluster:  runs any on-demand queries and reports, including queries from our experiments platform.

We use a fork of Lyft’s Trino Gateway to route queries to the appropriate cluster by inspecting header information in the query. Each of the Trino clusters runs on top of Kubernetes (Google GKE) which allows us to scale the clusters and perform blue-green deployments easily.

While our Trino deployment managed to process an admirable amount of data, our users had to deal with inconsistent query times depending on the load of the cluster, and occasionally situations where the cluster became so bogged down that almost no queries could complete. We had to get to work to identify what was causing these slow queries, and speed up Trino for our users.

The Problem 

When it comes to querying data, Shopify data scientists (rightfully) expect to get results within seconds. However, we encounter scenarios like interactive analytics, A/B testing (experiments), and reporting all in one place. In order to improve our query execution times, we focused on speeding up Trino, as it enables a larger portion of optimization to the final performance of queries executed via any SQL client software.

We wanted to achieve a query latency of P95 less than five seconds, which would be a significant decrease (approximately 30 times). That was a very ambitious target as approximately five percent of our queries were running around one to five minutes. To achieve this we started by analyzing these factors:

  • Query volumes
  • Most often queried datasets
  • Queries consuming most CPU wall time
  • Datasets that are consuming the most resources
  • Failure scenarios.

When analyzing the factors above, we discovered that it’s not necessarily the query volume itself that was driving our performance problems. We noticed a correlation between certain types of queries and datasets consuming the most resources that was creating a lot of error scenarios for us. So we decided to zoom in and look into the errors.

We started looking at error classes in particular:

A dashboard showing 0.44% Query Execution Failure rate and a 0.35% Highly relevant error rate. The dashboard includes a breakdown of the types of Presto errors.
Trino failure types breakdown

It can be observed that our resource relevant error rate (related to exceeding resource use) was around 0.35 percent, which was acceptable due to the load profile that was executed against Trino. What was most interesting for us was the ability to identify the queries that were timing out or causing a degradation in the performance of our Trino cluster. At first it was hard for us to properly debug our load specific problems, as we couldn’t recreate the state of Trino during the performance degradation scenarios. So, we created a Trino Query Replicator that allowed us to recreate any load from the past.

Recreating the state of Trino during performance degradation scenarios enabled us to drill down deeper on the classes of errors, and identify that the majority of our problems were related to:

  • Storage type: especially compressed JSON format of messages coming from Kafka.
  • Cluster Classes: using the ad-hoc server for everything, and not just what was scheduled.
  • CPU & Memory allocation: both on the coordinator and workers. We needed to scale up together with the number of queries and data.
  • JVM settings: we needed to tune our virtual machine options.
  • Dataset statistics: allowing for better query execution via cost based optimization available in Trino.

While we could write a full book diving into each problem, for this post we’ll focus on how we addressed problems related to JVM settings, CPU and Memory allocation, and cluster classes.

A line graph showing the P95 execution time over the month of December. The trend line shows that execution time was steadily increasing.
Our P95 Execution time and trend line charts before we fine tuned our infrastructure

The Solution

In order to improve Trino query execution times and reduce the number of errors caused by timeouts and insufficient resources, we first tried to “money scale” the current setup. By “money scale” we mean we scaled our infrastructure horizontally and vertically. We doubled the size of our worker pods to 61 cores and 220GB memory, while also increasing the number of workers we were running. Unfortunately, this alone didn’t yield stable results. For that reason, we dug deeper into the query execution logs, stack-traces, Trino codebase, and consulted Trino creators. From this exploration, we discovered that we could try the following:

  • Creating separate clusters for applications with predictable heavy compute requirements.
  • Lowering the number of concurrent queries to reduce coordinator lock contention.
  • Ensuring the recommended JVM recompilation settings are applied.
  • Limiting the maximum number of drivers per query task to prevent compute starvation.

Workload Specific Clusters

As outlined above, we initially had two Trino clusters: a Scheduled cluster and an Adhoc cluster. The shared cluster for user's ad hoc queries and the experiment queries was causing frustrations on both sides. The experiment queries were adding a lot of excess load causing user's queries to have inconsistent query times. A query that might take seconds to run could take minutes if there were experiment queries running. Correspondingly, the user's queries were making the runtime for the experiments queries unpredictable. To make Trino better for everyone, we added a new cluster just for the experiments queries, leveraging our existing deployment of Trino Gateway to route experiments queries there based on a HTTP header.

We also took this opportunity to write some tooling that allows users to create their own ephemeral clusters for temporary heavy-duty processing, or investigations with a single command (these are torn down automatically by an Airflow job after a defined TTL).

A system diagram showing the Trino infrastructure before changes. Mode and internal SQL clients feed into the Trino Gateway. The Gateway feeds into scheduled reports and adhoc queries.
Trino infrastructure before
A system diagram of the Trino infrastructure after changes. Mode and internal clients both feed into the Trino Gateway. The Gateway feeds into Scheduled Reports, Ad hoc queries, and experimental queries. In addition, the Internal SQL clients feed into Short-Term clusters
Trino infrastructure after

Lock Contention

After exhausting the conventional scaling up options, we moved onto the most urgent problem: when the Trino cluster overloaded and work wasn’t progressing, what was happening? By analyzing metrics output to Datadog, we were able to identify a few situations that would arise.One problem we identified was that the Trino cluster’s queued work would continue to increase, but no queries or splits were being dispatched. In this situation, we noticed that the Trino coordinator (the server that handles incoming queries) was running, but it stopped outputting metrics for minutes at a time. We originally assumed that this was due to CPU load on the coordinator (those metrics were also unavailable). However, after logging into the coordinator’s host and looking at the CPU usage, we saw that the coordinator wasn’t busy enough that it shouldn’t be able to report statistics. We proceeded to capture and analyze multiple stack traces and determined that the issue wasn’t an overloaded CPU, but lock contention against the Internal Resource Group object from all the active queries and tasks.

We set hardConcurrencyLimit to 60 in our root resource group to limit the number of running parallel queries and reduce the lock contention on the coordinator.

"rootGroups": [
    {
    "hardConcurrencyLimit": "60",

Resource group configuration

This setting is a bit of a balancing act between allowing enough queries to run to fully utilize the cluster, and capping the amount running to limit the lock contention on the coordinator.

A line graph showing Java Lang:System CPU load in percent over a period of 5 hours before the change. The graph highlights the spikes where metric dropouts happened six times.
Pre-change CPU graph showing Metrics dropouts due to lock contention
A line graph showing Java Lang:System CPU load in percent over a period of 5 hours after the change. The graph highlights there were no more metric dropouts.
Post change CPU graph showing no metrics dropouts

JVM Recompilation Settings

After the coordinator lock contention was reduced, we noticed that we would have a reasonable number of running queries, but the cluster throughput would still be lower than expected. This caused queries to eventually start queuing up. Datadog metrics showed that a single worker’s CPU was running at 100%, but most of the others were basically idle.

A line graph showing Java Lang:System CPU load by worker in percent over a period of 5 hours. It highlights that a single worker's CPU was running at 100%
CPU Load distribution by worker

We investigated this behaviour by doing some profiling of the Trino process with jvisualvm while the issue was occurring. What we found was that almost all the CPU time was spent either: 

  1. Doing GCM AES decryption of the data coming from GCS.
  2. JSON deserialization of that data.

What was curious to us is that the datasets the affected workers were processing were no different than any of the other workers. Why were these using more CPU time to do the same work?After some trial and error, we found setting the following JVM options prevented our users from being put in this state:

-XX:PerMethodRecompilationCutoff=10000
-XX:PerBytecodeRecompilationCutoff=10000

JVM settings

It’s worth noting that these settings were added to the recommended JVM options in a later version of Trino than we were running at the time. There’s a good discussion about those settings in the trino GitHub repo! It seems that we were hitting a condition that was causing the JVM to no longer attempt compilation of some methods, which caused them to run in the JVM interpreter rather than as compiled code which is much, much slower.

In the graph below, the CPU of the workers is more aligned without the ‘long tail’ of the single worker running at 100 percent.

A line graph showing
CPU Load distribution by worker

Amount of Splits Per Stage Per Worker

In the process of investigating the performance of queries, we happened to come across an interesting query via the Trino Web UI:

A screenshot showing the details displayed on the Trino WebUI. It includes query name, execution time, size, etc.
Trino WebUI query details

What we found was one query had a massive number of running splits: approximately 29,000. This was interesting because, at that time, our cluster only had 18,000 available worker threads, and our Datadog graphs showed a maximum of 18,000 concurrent running splits. We’ll chalk that up to an artifact of the WebUI. Doing some testing with this query, we discovered that a single query could monopolize the entire Trino cluster, starving out all the other queries.After hunting around the Slack and forum archives, we came across an undocumented configuration option: `task.max-drivers-per-task`. This configuration enabled us to limit the maximum number of splits that can be scheduled per stage, per query, per worker. We set this to 16, which limited this query to around 7,200 active splits.

The Results and What’s Next

Without leveraging the storage upgrade and by tapping into cluster node sizing, cluster classes, Trino configs, and JVM tuning, we managed to bring down our execution latency to 30 seconds and provide a stable environment for our users. The below charts present the final outcome:

A bar graph showing the large decrease in execution time before the change and after the change.
Using log scale binned results for execution time before and after
A line graph showing the P95 execution time over a 3 month period.  The trend line shows that execution time reduces.
P95 Execution time and trendline over 3 month period

The changes in the distribution of queries being run within certain bins shows that we managed to move more queries into the zero to five second bucket and (most importantly) limited the time that the heaviest queries were executed at. Our execution time trendline speaks for itself, and as we’re writing this blog, we hit less than 30 seconds with P95 query execution time.

By creating separate clusters, lowering the number of concurrent queries, ensuring the recommended JVM recompilation setting were applied, and limiting the maximum number of drivers per query task, we were able to scale our interactive query infrastructure. 

While addressing the infrastructure was an important step to speed up our query execution, it’s not our only step. We still think there is room for improvement and are working to make Trino our primary interactive query engine. We’re planning to put further efforts into:

  • Making our storage more performant (JSON -> Parquet).
  • Introducing a Alluxio Cache layer.
  • Creating a load profiling tooling.
  • Enhancing our statistics to improve the ability of the Trino query optimizer to choose the most optimal query execution strategy, not just the overall performance of user queries.
  • Improving our Trino Gateway by rolling out Shopify Trino Conductor (a Shopify specific gateway), improving UI/infrastructure, and introducing weighted query routing.

Matt Bruce: Matt is a four-year veteran at Shopify serving as Staff Data Developer for the Foundations and Orchestration team. He’s previously helped launch many open source projects in Shopify including Apache Druid and Apache Airflow, as well as migrating Shopify’s Hadoop and Presto infrastructure from physical Data centers into cloud based services.

Bruno Deszczynski: Bruno is a Data Platform EPM working with the Foundations team. He is obsessed with making Trino execute interactive analytics queries (P95) below five seconds in Shopify.

Continue reading

High Availability by Offloading Work Into the Background

High Availability by Offloading Work Into the Background

Unpredictable traffic spikes, slow requests to a third-party payment gateway, or time-consuming image processing can easily overwhelm an application, making it respond slowly or not at all. Over Black Friday Cyber Monday (BFCM) 2021, Shopify merchants made sales of over 5 Billion USD, with peak sales of over 100 Million USD per hour. On such a massive scale, high availability and short response times are crucial. But even for smaller applications, availability and response times are important for a great user experience.

BFCM by the numbers globally 2020: Total Sales: $5.1B USD, Average cart prices $89.20 USD, Peak sales hour $102M+ at 12 pm EST, 50% more consumers buying from Shopify Merchant
BFCM by the numbers

High Availability

High availability is often conflated with a high server uptime. But it’s not sufficient that the server hasn’t crashed or shut down. In the case of Shopify, our merchants need to be able to make sales. So a buyer needs to be able to interact with the application. A banner saying “come back later” isn’t sufficient, and serving only one buyer at a time isn’t good enough either. To consider an application available, the community of users needs to have meaningful interactions with the application. Availability can be considered high if these interactions are possible whenever the users need them to be.

Offloading Work

In order to be available, the application needs to be able to accept incoming requests. If the external-facing part of the application (the application server) is also doing the heavy lifting required to process the requests, it can quickly become overwhelmed and unavailable for new incoming requests. To avoid this, we can offload some of the heavy lifting into a different part of the system, moving it outside of the main request response cycle to not impact the application server’s availability to accept and serve incoming requests. This also shortens response times, providing a better user experience.

Commonly offloaded tasks include:
  • sending emails
  • processing images and videos
  • firing webhooks or making third party requests
  • rebuilding search indexes
  • importing large data sets
  • cleaning up stale data

The benefits of offloading a task are particularly large if the task is slow, consumes a lot of resources, or is error-prone.

For example, when a new user signs up for a web application, the application usually creates a new account and sends them a welcome email. Sending the email is not required for the account to be usable, and the user doesn’t receive the email immediately anyways. So there’s no point in sending the email from within the request response cycle. The user shouldn’t have to wait for the email to be sent, they should be able to start using the application right away, and the application server shouldn’t be burdened with the task of sending the email.

Any task not required to be completed before serving a response to the caller is a candidate for offloading. When uploading an image to a web application, the image file needs to be processed and the application might want to create thumbnails in different sizes. Successful image processing is often not required for the user to proceed, so it’s generally possible to offload this task. However, the server can no longer respond, saying “the image has been successfully processed” or “an image processing error has occurred.” Now, all it can respond with is “the image has been uploaded successfully, it will appear on the website later if things go as planned.” Given the very time-consuming nature of image processing, this trade-off is often well worth it, given the massive improvement of response time and the benefits of availability it provides.

Background Jobs

Background jobs are an approach to offloading work. A background job is a task to be done later, outside of the request response cycle. The application server delegates the task, for example, the image processing, to a worker process, which might even be running on an entirely different machine. The application server needs to communicate the task to the worker. The worker might be busy and unable to take on the task right away, but the application server shouldn’t have to wait for a response from the worker. Placing a message queue between the application server and worker solves this dilemma, making their communication asynchronous. Sender and receiver of messages can interact with the queue independently at different points in time. The application server places a message into the queue and moves on, immediately becoming available to accept more incoming requests. The message is the task to be done by the worker, which is why such a message queue is often called a task queue. The worker can process messages from the queue at its own speed. A background job backend is essentially some task queues along with some broker code for managing the workers.

Features

Shopify queues tens of thousands of jobs per second in order to leverage a variety of features.

Response times

Using background jobs allows us to decouple the external-facing request (served by the application server) from any time-consuming backend tasks (executed by the worker). thus improving response times. What improves response times for individual requests also improves the overall availability of the system.

Spikeability

A sudden spike in, say, image uploads, doesn’t hurt if the time consuming image processing is done by a background job. The availability and response time of the application server is constrained by the speed with which it can queue image processing jobs. But the speed of queueing more jobs is not constrained by the speed of processing them. If the worker can’t keep up with the increased amount of image processing tasks, the queue grows. But the queue serves as a buffer between the worker and the application server so that users can continue uploading images as usual. With Shopify facing traffic spikes of up to 170k requests per second, background jobs are essential for maintaining high availability despite unpredictable traffic.

Retries and Redundancy

When a worker encounters an error while running the job, the job is requeued and retried later. Since all of that is happening in the back, it's not affecting the availability or response times of the external facing application server. It makes background jobs a great fit for error-prone tasks like requests to an unreliable third party.

Parallelization

Several workers might process messages from the same queue allowing more than one task to be worked on simultaneously. This is distributing the workload. We can also split a large task into several smaller tasks and queue them as individual background jobs so that several of these subtasks are worked on simultaneously.

Prioritization

Most background job backends allow for prioritizing jobs. They might use priority queues that don’t follow the first in - first out approach so that high-priority jobs end up cutting the line. Or they set up separate queues for jobs of different priorities and configure workers to prioritize jobs from the higher priority queues. No worker needs to be fully dedicated to high-priority jobs, so whenever there’s no high-priority job in the queue, the worker processes lower-priority jobs. This is resourceful, reducing the idle time of workers significantly.

Event-based and Time-based Scheduling

Background jobs aren’t always queued by an application server. A worker processing a job can also queue another job. While they queue jobs based on events like user interaction, or some mutated data, a scheduler might queue jobs based on time (for example, for a daily backup).

Simplicity of Code

The background job backend encapsulates the asynchronous communication between the client requesting the job and the worker executing the job. Placing this complexity into a separate abstraction layer keeps the concrete job classes simple. A concrete job class only implements the task to be done (for example, sending a welcome email or processing an image). It’s not aware of being run in the future, being run on one of several workers, or being retried after an error.

Challenges

Asynchronous communication poses some challenges that don’t disappear by encapsulating some of its complexity. Background jobs aren’t any different.

Breaking Changes to Job Parameters

The client queuing the job and the worker processing it doesn’t always run the same software version. One of them might already have been deployed with a newer version. This situation can last for a significant amount of time, especially if practicing canary deployments. Changes to the job parameters can break the application if a job has been queued with a certain set of parameters, but the worker processing the job expects a different set. Breaking changes to the job parameters need to roll out through a sequence of changes that preserve backward compatibility until all legacy jobs from the queue have been processed.

No Exactly-once Delivery

When a worker completes a job, it reports back that it’s now safe to remove the job from the queue. But what if the worker processing the job remains silent? We can allow other workers to pick up such a job and run it. This ensures that the job runs even if the first worker has crashed. But if the first worker is simply a little slower than expected, our job runs twice. On the other hand, if we don’t allow other workers to pick up the job, the job will not run at all if the first worker did crash. So we have to decide what’s worse: not running the job at all, or running it twice. In other words, we have to choose between at least and at most-once delivery.

For example, not charging a customer is not ideal, but charging them twice might be worse for some businesses. At most-once delivery sounds right in this scenario. However, if every charge is carefully tracked and the job checks those states before attempting a charge, running the job a second time doesn’t result in a second charge. The job is idempotent, allowing us to safely choose at-least once delivery.

Non-Transactional Queuing

The job queue is often in a separate datastore. Redis is a common choice for the queue, while many web applications store their operational data in MySQL or PostgreSQL. When a transaction for writing operational data is open, queuing the job will not be part of this enclosing transaction - writing the job into Redis isn’t part of a MySQL or PostgreSQL transaction. The job is queued immediately and might end up being processed before the enclosing transaction commits (or even if it rolls back).

When accepting external input from user interaction, it’s common to write some operational data with very minimal processing, and queue a job performing additional steps processing that data. This job may not find the data it needs unless we queue it after committing the transaction writing the operational data. However, the system might crash after committing the transaction and before queuing the job. The job will never run, the additional steps for processing the data won’t be performed, leaving the system in an inconsistent state.

The outbox pattern can be used to create transactionally staged job queues. Instead of queuing the job right away, the job parameters are written into a staging table in the operational data store. This can be part of a database transaction writing operational data. A scheduler can periodically check the staging table, queue the jobs, and update the staging table when the job is queued successfully. Since this update to the staging table might fail even though the job was queued, the job is queued at least once and should be idempotent.

Depending on the volume of jobs, transactionally staged job queues can result in quite some load on the database. And while this approach guarantees the queuing of jobs, it can’t guarantee that they will run successfully.

Local Transactions

A business process might involve database writes from the application server serving a request and workers running several jobs. This creates the problem of local database transactions. Eventual consistency is reached when the last local transaction commits. But if one of the jobs fails to commit its data, the system is again in an inconsistent state. The SAGA pattern can be used to guarantee eventual consistency. In addition to queuing jobs transactionally, the jobs also report back to the staging table when they succeed. A scheduler can check this table and spot inconsistencies. This results in an even higher load on the database than a transactionally staged job queue alone.

Out of Order Delivery

The jobs leave the queue in a predefined order, but they can end up on different workers and it’s unpredictable which one completes faster. And if a job fails and is requeued, it’s processed even later. So if we’re queueing several jobs right away, they might run out of order. The SAGA pattern can ensure jobs are run in the correct order if the staging table is also used to maintain the job order.

A more lightweight alternative can be used if consistency guarantees are not of concern. Once a job has completed its tasks, it can queue another job as a follow-up. This ensures the jobs run in the predefined order. The approach is quick and easy to implement since it doesn’t require a staging table or a scheduler, and it doesn’t generate any extra load on the database. But the resulting system can become hard to debug and maintain since it’s pushing all its complexity down a potentially long chain of jobs queueing other jobs, and little observability into where exactly things might have gone wrong.

Long Running Jobs

A job doesn’t have to be as fast as a user-facing request, but long-running jobs can cause problems. For example, the popular ruby background job backend Resque prevents workers from being shut down while a job is running. This worker cannot be deployed. It is also not very cloud-friendly, since resources are required to be available for a significant amount of time in a row. Another popular ruby background job backend, Sidekiq, aborts and requeues the job when a shutdown of the worker is initiated. However, the next time the job is running, it starts over from the beginning, so it might be aborted again before completion. If deployments happen faster than the job can finish, the job has no chance to succeed. With the core of Shopify deploying about 40 times a day, this is not an academic discussion but an actual problem we needed to address.

Luckily, many long-running jobs are similar in nature: they iterate over a huge dataset. Shopify has developed and open sourced an extension to Ruby on Rails’s Active Job framework, making this kind of job interruptible and resumable. It sets a checkpoint after each iteration and requeues the job. Next time the job is processed, work continues at the checkpoint, allowing for safe and easy interruption of the job. With interruptible and resumable jobs, workers can be shut down any time, which makes them more cloud-friendly and allows for frequent deployments. Jobs can be throttled or halted for disaster prevention, for example, if there’s a lot of load on the database. Interrupting jobs also allows for safely moving data between database shards.

Distributed Background Jobs

Background job backends like Resque and Sidekiq in Ruby usually queue a job by placing a serialized object into the queue, an instance of the concrete job class. This implies that both the client queuing the job and the worker processing it needs to be able to work with this object and have an implementation of this class. This works great in a monolithic architecture where clients and workers are running the same codebase. But if we would like to, say, extract the image processing into a dedicated image processing microservice, maybe even written in a different language, we need a different approach to communicate.

It is possible to use Sidekiq with separate services, but the workers still need to be written in Ruby and the client has to choose the right redis queue for a job. So this approach is not easily applied to a large-scale microservices architecture, but avoids the overhead of adding a message broker like RabbitMQ.

A message-oriented middleware like RabbitMQ places a purely data-based interface between the producer and consumer of messages, such as a JSON payload. The message broker can serve as a distributed background job backend where a client can offload work onto workers running an entirely different codebase.

Instead of simple point-to-point queues, topics that leverage task queues add powerful routing. In contrast to HTTP, this routing is not limited to 1:1. Beyond delegating specific tasks, messaging can be used for different event messages whenever communication between microservices is needed. With messages removed after processing, there’s no way to replay the stream of messages and no source of truth for a system-wide state.

Event streaming like Kafka has an entirely different approach: events are written into an append-only event log. All consumers share the same log and can read it at any time. The broker itself is stateless; it doesn’t track event consumption. Events are grouped into topics, which provides some publish subscribe capabilities that can be used for offloading work to different services. These topics aren’t based on queues, and events are not removed. Since the log of events can be replayed, it can serve, for example, as a source of truth for event sourcing. With a stateless broker and append-only writing, throughput is incredibly high—a great fit for real-time applications and data streaming.

Background jobs allow the user-facing application server to delegate tasks to workers. With less work on its plate, the application server can serve user-facing requests faster and maintain a higher availability, even when facing unpredictable traffic spikes or dealing with time-consuming and error-prone backend tasks. The background job backend encapsulates the complexity of asynchronous communication between client and worker into a separate abstraction layer, so that the concrete code remains simple and maintainable.

High availability and short response times are necessary for providing a great user experience, making background jobs an indispensable tool regardless of the application’s scale.

Kerstin is a Staff Developer transforming Shopify’s massive Rails code base into a more modular monolith, building on her prior experience with distributed microservices architectures .


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

Understanding GraphQL for Beginners–Part Two

Understanding GraphQL for Beginners–Part Two

Welcome back to part two of the Understanding GraphQL for Beginners series. In this tutorial, we’ll build GraphQL fields about food! If you did not read part one of this series, please read it before reading this part.

As a refresher, GraphQL is a data manipulation and query language for APIs. The two main benefits of implementing GraphQL are

  1. The ability to describe the structure you want back as your response.
  2. Only needing one endpoint to consume one or more resources.

Learning Outcomes

  • Examine the file directory of GraphQL.

  • Identify the difference between root fields and object fields.

  • Create a GraphQL object based on an existing Ruby on Rails model.

  • Create a GraphQL root field to define the structure of your response.

  • Use a GraphQL root field to query data within a database.

  • Examine how the GraphQL endpoint works.

Before You Start

Download the repository to follow along in this tutorial. The repository has been set up with models and gems needed for GraphQL. Once downloaded, seed the database.

The following models are

Food

Attribute Type
id Bigint
Name String
place_of_origin String
image String
created_At Timestamp
updated_at Timestamp

Nutrition

Attribute Type
id Bigint
food_id Bigint
serving_size String
calories String
total_fat String
trans_fat String
saturated_fat String
cholesterol String
sodium String
Potassium String
total_carbohydrate String
dietary_fiber String
sugars String
protein String
vitamin_a String
vitamin_c String
calciuum String
iron String
created_at Timestamp
update_at Timestamp

GraphQL File Structure

Everything GraphQL related is found in the folder called “graphql” under “/app”. Open up your IDE editor, and look at the file structure under “graphql”.

A screenshot of the directory structure of the folder graphql in an IDE. Under the top folder graphql is mutations and types and they are surrounded by a yellow box.  Underneath them is foo_app_schema.rb.
Directory structure of the folder graphql

In the yellow highlighted box, there are two directories here:

  1. “Mutations”
    This folder contains classes that will modify (create, update or delete) data.
  2. “Types”
    This folder contains classes that define what will be returned. As well as the type of queries (mutation_type.rb and query_type.rb) that can be called.

In the red highlighted box, there’s one important file to note.

A screenshot of the directory structure of the folder graphql in an IDE. Under the top folder graphql is mutations and types. Underneath them is foo_app_schema.rb which is surrounded by a red box.
Directory structure of the folder graphql

The class food_app_schema.rb, defines the queries you can make.

Creating Your First GraphQL Query all_food

We’re creating a query that returns us a list of all the food. To do that, we need to create fields. There are two kinds of fields:

  1. Root fields define the structure of your response fields based on the object fields selected. They’re the entry points (similar to endpoints) to your GraphQL server.
  2. Object fields are attributes of an object.

We create a new GraphQL object called food. On your terminal run rails g graphql:object food. This will create the file food_type.rb filled with all the attributes found in your foods table in db/schema.rb. Your generated food_type.rb will look like:

This class contains all the object fields, exposing a specific data of an object. Next, we need to create the root field that allows us to ask the GraphQL server for what we want. Go to the file query_type.rb that’s a class that contains all root fields.

Remove the field test_field and its method. Create a field called all_food like below. As food is both a singular and plural term, we use all_food to be plural.

The format for field is as followed:

  1. The field name (:all_food).
  2. The return type for the field ([Types::FoodType]).
  3. Whether the field will ever be null (null: false). By setting this to false, it means that the field will never be null.
  4. The description of the field (description: "Get all the food items.").

Congratulations, you’ve created your first GraphQL query! Let’s go test it out!

How to Write and Run Your GraphQL Query

To test your newly created query, we use the playground, GraphiQL, to execute the all_food query. To access GraphiQL, add the following URI to your web address: localhost:3000/graphiql.

You will see this page:

 A screenshot of the GraphiQL playground.  There are two large text boxes side by side. The left text box is editable and the right isn't. The menu item at the top shows the GraphiQL name, a play button, Prettify button, and History button.
GraphiQL playground

The left side of the page is where we will write our query. The right side will return the response to that query.

Near the top left corner next to the GraphiQL text contains three buttons:

A  screenshot of the navigation menu of the GraphiQL. The menu item shows the a play button, a Prettify button, and a History button.
GrapiQL playground menus
  1. This button will execute your query.
  2. This button will reformat your query to look pretty.
  3. This button will show all previous queries you ran.
  4. On the right corner of the menu bar is a button called “< Docs”.
A screenshot of the <Docs menu item. There is a large red arrow pointing to the menu item and it says click here.
Docs menu item

If you click on the “< Docs” button in the top right corner, you can find all the possible queries based on your schema.

A screenshot of the <Docs menu item. The page is called "Document Explorer" and displays a search field allowing the user to search for a schema.  Underneath the search the screen lists two Root Types in a list: "query:Query" and "mutation: Mutation." There is a red box around "query:Query" and the words "Click here." to its right.
Document explorer

The queries are split into two groups, query and mutation. “query” which contains all queries that do not modify data. Queries that do modify data can be found in “mutation”. Click on “query: Query” to find the “all_food” query we just created.

A screenshot of the Query screen with a result displaying the field allFood: [Food!]!
Query screen

After clicking on “query: Query”, you will find all the possible queries you can make. If you click on [Food!]!, you will see all the possible fields we can ask for.

A screenshot listing the all fields contained in the all_food query.
Fields in the all_food query

These are all the possible fields you can use within your all_food query.Remember, GraphQL allows us to describe exactly what we need. Let’s say we only want the ids and names of all food items. We write the query as

Click the execute button to run the query. You get the following response back:

Awesome job! Now, create another query to get the image and place_of_origin fields back:

You will get this response back.

What’s Happening Behind the Scenes?

Recall from part one, GraphQL has this single “smart” endpoint that bundles all different types of RESTful actions under one endpoint. This is what happens when you make a request and get a response back.

A flow diagram showing the steps to execute a query between the client and the food app's server.

When you execute the query:

  1. You call the graphql endpoint with your request (for example, query and variables).
  2. The graphql endpoint then calls the execute method from the graphql_controller to process your request.
  3. The method renders a JSON containing a response catered to your request.
  4. You get a response back.

Try it Yourself #1

Try to implement the root field called nutrition. Like all_food, it returns all nutrition facts.

If you need any assistance, please refer to this gist that includes a sample query and response: https://gist.github.com/ShopifyEng/7c196bf443bdf26e55f827d65ee490a6

Adding a Field to an Existing Query.

You may have noticed that the nutrition table contains a foreign key, where a food item has one nutrition fact. Currently, it’s associated at the model level but not used at the GraphQL level. For someone to query food and get the nutrition fact as well, we need to add a nutrition field to food.

Add the following field to food_type.rb:

field :nutrition, Types::NutritionType, null: true

Let’s execute the following query where we want to know the serving size and calories of each food item:

You will get this response back:

Hooray! We now know the serving size and calories of each food item!

So far, we learned how to create root fields to query all data of a specific resource. Let’s write a query to look at data based on id.

Writing a Query with an Argument

In query_type.rb, we need to add another root field called food that requires and takes an argument called id:

On GraphiQL, let’s execute this query:

You will get this response back:

Try it Yourself #2

This time, create a root field called find_food, which returns a set of data based on place_of_origin.

If you need any assistance, please refer to this gist that includes a sample query and response: https://gist.github.com/ShopifyEng/1f92cee91f2932a0ef665594418764d3

As we’ve reached the end of this tutorial, let’s recap what we learned!

  1. GraphQL generates and populates an object if the model with the same name exists.
  2. Root fields define the structure of your response and are entry points to your GraphQL server.
  3. Object fields are an object’s attributes.
  4. All requests are processed by the graphql_controller’s execute method and return a JSON response back.

I hope you enjoyed creating some GraphQL queries! One thing you might still be wondering is how do we update these ActiveRecord objects? In part 3 of Understanding GraphQL for Beginners, we’ll continue creating queries called mutations that create, update, or delete data.

If you would like to see the finished code for part two, check out the branch called part-2-solution.

Often mistaken as an intern, Raymond Chung is building a new generation of software developers through Shopify's Dev Degree program. As a Technical Educator, his passion for computer science and education allows him to create bite-sized content that engages interns throughout their day-to-day. When he is not teaching, you'll find Raymond exploring for the best bubble tea shop.


We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Continue reading

Understanding GraphQL for Beginners–Part One

Understanding GraphQL for Beginners–Part One

As developers, we’re always passionate about learning new things! Whether it’s a new framework or a new language, our curiosity takes us places! One term you may have heard of is REST. REST stands for REpresentational State Transfer - a software architecture style introduced by Roy Fielding in the year 2000, with a set of principles on how a web application should behave. Think of this as a guideline of operations, like how to put together a meal. One of the principles is that one endpoint should only do one CRUD action (either create, read, update, or delete). As well, each RESTful endpoint returns a fixed set of data. I like to think of this as a cookie-cutter response, where you get the same shape back every time. Sometimes you may only need less data, and other times you may need more data. This can lead to the issue of calling additional APIs to get more data. How can we get exactly the right amount of data and under one call?

As technology evolves, one thing that contrasts REST and is gaining popularity is GraphQL. But what is it exactly? Within this blog, we will learn what GraphQL is all about!

Learning Outcomes

  • Explain what GraphQL is.
  • Use an analogy to deepen your understanding of GraphQL.
  • Explain the difference between REST and GraphQL.

Before You Start

If you are new to API development, here are some terminologies for reference. Otherwise, continue ahead.

API

What is API?

Application Programming Interface (API) allows two machines to talk to each other. You can think of it as the cashier who takes your request to the kitchen to prepare and gives you your meal when ready.

Why are APIs important?

APIs allow multiple devices like your laptop and phone to talk to the same backend server. APIs that use REST are RESTful.

REST

What is REST?

REpresentational State Transfer (REST) is a software architecture style on how a web application should behave. Think of this as a guideline of operations, like how to put a meal together.

Why is REST important?

REST offers a great deal of flexibility like handling different types of calls and responses. It breaks down a resource into CRUD services, making it easier to organize what each endpoint should be doing. One of REST’s key principles is client-server separation of concerns. This means that any issues that happen on the server are only concerned by the server. All the client cares about is getting a response back based on their request to the server.

Latency Time

What is latency time?

Latency time is the time it takes for your request to travel to the server to be processed. You can think of this like driving from point A to B. Sometimes there are delays due to traffic congestion.

Why is latency time important?

The lower the latency, the faster the request can be processed by the server. The higher the latency, the longer it takes for your request to be processed.

Response Time

What is response time?

Response time is the sum of latency time and the time it takes to process your request. Think of this as the time it takes since you ordered your meal.

Why is response time important?

Like latency, the faster the response time, the more seamless the overall experience feels for users. The slower the response time, the less seamless it feels for users, and they may quit your application.

What Is GraphQL?

GraphQL is an open-source data query and manipulation language for APIs, released publicly by Facebook in 2015.

Unlike REST, GraphQL offers the flexibility for clients to describe the structure of the data they need in the form of a query. No more and no less. The best part is it's all under one endpoint! The response back will be exactly what you described, and not a cookie-cutter response.

For example, provided below, we have three API responses about the Toronto Eagles, their championships, and their players. If we want to look at the year the Toronto Eagles were founded, the first and last name of the team captain and their last championship, we need to make three separate RESTful calls.

Call 1:

Call 2:

Call 3:

When you make an API call, it’s ideal to get a response back within a second. The response time is made up of latency time and processing time. With three API calls, we are making three trips to the server and back. You may expect that the latency times for all three calls would be the same. That will never be the case. You can think of latency like driving in traffic, sometimes it's fast, and sometimes it's slow due to rush hour. If one of the calls is slow, that means the overall total response time is slow!

Luckily with GraphQL, we can combine the three requests together, and get the exact amount of data back on a single trip!

GraphQL query:

GraphQL response:

GraphQL Analogies

Here are two analogies to help describe how GraphQL compares to REST.

Analogy 1: Burgers

Imagine you are a customer at a popular burger restaurant, and you order their double cheeseburger. Regardless of how many times you order (calling your RESTful API), you get every ingredient in that double cheeseburger every time. It will always be the same shape and size (what’s returned in a RESTful response).

An image of a two pattie hamburger on a sesame seed bun with cheese, bacon, pickles, red pepper, and secret sauce
Photo by amirali mirhashemian on Unsplash.

With GraphQL, you can “have it your way” by describing exactly how you want that double cheeseburger to be. I’ll take a double cheeseburger with fewer pickles, cheese not melted, bacon on top, sautéed onions on the bottom, and finally no sesame seeds on the bottom bun.

Your GraphQL response is shaped and sized to be exactly how you describe it.

A two pattie hamburger on a sesame seed bun with cheese, bacon, pickles, red pepper, and secret sauce
Photo by amirali mirhashemian on Unsplash.

Analogy 2: Banks

You are going to the bank to make a withdrawal for $200. Using the RESTful way, you won’t be able to describe how you want your money to be. The teller (response) will always give you two $100 bills.

RESTful response:

An image of two rectangles side by side. Each rectangle represents $100 and that text is contained within each rectangle.
Two $100 bills

By using GraphQL, you can describe exactly how you want your denominations to be. You can request one $100 bill and five $20 bills.

GraphQL response:

An image of six rectangles in a three by three grid. The first rectangle starting from the top left represents $100 and the other five represent $20 from the text contained within each rectangle.
One $100 bill and five $20 bills

REST Vs GraphQL

Compared to RESTful APIs, GraphQL provides more flexibility on how to ask for data from the server. It provides four main benefits over REST:

  1. No more over fetching extra data.
    With REST APIs, a fixed set of data (same size and shape response) is returned. Sometimes, a client doesn’t need all the data. GraphQL solves this by having the clients grab only what they need.
  2. No more under fetching data.
    Sometimes, a client may need more data. Additional calls must be made to get data that an endpoint may not have.
  3. Rapid product iterations on the front end.
    Flexible structure catered to clients. Frontend developers can make UI changes without asking the backend developers to make changes to cater frontend design changes.
  4. Fewer endpoints.
    Calling too many endpoints can get confusing really fast. GraphQL’s single “smart” endpoint bundles all different types of RESTful actions under one.

By leveraging GraphQL’s principles of describing the structure of the data you want back, you don’t need to make multiple trips for some cookie-cutter responses. Read part two of Understanding GraphQL for Beginners as we’ll implement GraphQL to a Ruby on Rails application, and create and execute queries!

Often mistaken as an intern, Raymond Chung is building a new generation of software developers through Shopify's Dev Degree program. As a Technical Educator, his passion for computer science and education allows him to create bite-sized content that engages interns throughout their day-to-day. When he is not teaching, you'll find Raymond exploring for the best bubble tea shop.


We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Continue reading

Let’s Encrypt x Shopify: Securing the Web 4.5 Million Domains at a Time

Let’s Encrypt x Shopify: Securing the Web 4.5 Million Domains at a Time

On June 30, 2021 Shipit!, our monthly event series, presented Let’s Encrypt and Shopify: Securing Shopify’s 4.5 Million Domains. Learn about how we secure over 4.5M Shopify domains and team up to foster a safer Internet for everyone. The video is now available.

It’s already been six years since Shopify became a sponsor of Let’s Encrypt.

In 2016, the SSL team started transitioning all of our merchants' stores to HTTPS. When we started exploring the concept a few years earlier, it was a daunting task. There were few providers that could let us integrate a certificate authority programmatically. The few that did had names like “Reseller API.” The idea that you would give away certificates for free and no human would be involved was completely alien in this market. Everything was designed with the idea that a user would be purchasing the certificate, downloading it, and installing it somehow. It’s a lot more problematic than you might think. For example, a lot of those API return human readable error messages instead of having a defined error code. Normally, they would expect the implentor to send back the message to the user trying to purchase a certificate, but in a fully automated system there is no user to read anything. For Shopify, all 650,000 domains would get a certificate, and they would be provisioned and renewed without any interactions from our merchants.

I first heard about Let’s Encrypt in 2014. A lot of the chatter online was around the fact that they would become a certificate authority providing free certificates (they were pretty expensive until now), but a bit less about the other part of the project, the ACME protocol. The idea was to fully automate the certificate authorities using standardized APIs.

In the summer of 2015 they still hadn’t launched, but I started to write a Ruby implementation of the ACME client protocol on the weekend to get a feel for it. I’d already been through this exercise a few times with other providers. Working from a specification was pretty refreshing. They’re boring documents, but when trying to automate hundreds of thousands of domains that you don’t really control, you want to know that you have all your exceptions accounted for. That’s when we reached out to them to figure out how Shopify could help and agreed on a sponsorship. We didn’t intend to make use of their service, at least not in the immediate future, but we share value around the open web and the importance of removing barriers of entry using technology.

Interacting with a small organization that does their work fully open was also quite refreshing. My experience dealing with certificate authority would be to work with an account manager who forwards my question to a technical team. The software they run is usually not implemented by them, so there is a limit to how much they can answer questions. Let’s Encrypt being fully open changes the dynamic. I asked questions on IRC and they answered me with github links that point at the actual implementation. I reported bugs or inconsistencies in the specification, and they tagged me in the pull request that fixed it.

In late November, we started rolling out our shiny new automated provisioning system. We immediately ran into some scalability issues with our initial providers. We did some napkin math with the throttling they were imposing on us, we would need about 100 days to provision every domain. We let it run over the holidays and launched in February 2016.

The team was already engaged in its next mission but in the back of our mind we knew we needed to revisit this. Now that the bulk of the domains were done, new domains would come at a slower pace and eventually renewal, but that would be good for a while at our current growth projection. Our main concern was emergency rotation. If for some reason we had to rotate our private keys or the certificate chain was compromised somehow, we’d be in trouble. A 100 days is too slow to react to an incident.

We needed to be more responsive for our merchants, and that’s why we decided to add Let’s Encrypt as a backup option. We were able to roll Let’s Encrypt out in a few hours compared to months with our original providers. The errors we ran into were predictable because of their specification and server implementation being open source, so we could refer directly to it to debug unexpected behaviour. It was so reliable that we decided to make them our main certificate authority.

Let's Encrypt is a game changer for the industry. For a big software-as-a-service company like Shopify, it saves time because their implementation is built around an open specification. You can even change or add a new certificate authority that supports the ACME protocol without redesigning or having to change your entire infrastructure if you wanted to. It's more reliable than the API from the past because it's designed to be fully automated from the beginning.

Shipit! Presents Let’s Encrypt and Shopify: Securing Shopify’s 4.5 Million Domains

Shipit! welcomes Josh Aas, co-founder and Executive Director of Let’s Encrypt and Shopify’s Charles Barbier, Application Security Development Manager, to talk about securing over 4.5 million Shopify domains and teaming up to foster a safer Internet for everyone.

Additional Information

Charles Barbier is a Developer Lead for the Application security team. You can connect with him on Twitter.


We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Continue reading

Rate Limiting GraphQL APIs by Calculating Query Complexity

Rate Limiting GraphQL APIs by Calculating Query Complexity

Rate limiting is a system that protects the stability of APIs. GraphQL opens new possibilities for rate limiting. I’ll show you Shopify’s rate limiting system for the GraphQL Admin API and how it addresses some limitations of common methods commonly used in REST APIs. I’ll show you how we calculate query costs that adapt to the data clients need while providing a more predictable load on servers.

What Is Rate Limiting and Why Do APIs Need It?

To ensure developers have a reliable and stable API, servers need to enforce reasonable API usage. The most common cases that can affect platform performance are

  • Bad actors abusing the API by sending too many requests.
  • Clients unintentionally sending requests in infinite loops or sending a high number of requests in bursts.

The traditional way of rate limiting APIs is request-based and widely used in REST APIs. Some of them have a fixed rate (that is clients are allowed to make a number of requests per second). The Shopify Admin REST API provides credits that clients spend every time they make a request, and those credits are refilled every second. This allows clients to keep a request pace that never limits the API usage (that is two requests per second) and makes occasional request bursts when needed (that is making 10 requests per second).

Despite widely used, the request-based model has two limitations:

  • Clients use the same amount of credits regardless, even if they don’t need all the data in an API response.
  • POST, PUT, PATCH and DELETE requests produce side effects that demand more load on servers than GET requests, which only reads existing data. Despite the difference in resource usage, all these requests consume the same amount of credits in the request-based model.

The good news is that we leveraged GraphQL to overcome these limitations and designed a rate limiting model that better reflects the load each request causes on a server.

The Calculated Query Cost Method for GraphQL Admin API Rate Limiting

In the calculated query cost method, clients receive 50 points per second up to a limit of 1,000 points. The main difference from the request-based model is that every GraphQL request has a different cost.

Let’s get started with our approach to challenges faced by the request-based model. 

Defining the Query Cost for Types Based on the Amount of Data it Requests

The server performs static analysis on the GraphQL query before executing it. By identifying each type used in a query, we can calculate its cost.

Objects: One Point

The object is our base unit and worth one point. Objects usually represent a single server-side operation such as a database query or a request to an internal service.

Scalars and Enums: Zero points

You might be wondering, why do scalars and enums have no cost? Scalars are types that return a final value. Some examples of scalar types are strings, integers, IDs, and booleans. Enums is a special kind of scalar that returns one of a predefined set of values. These types live within objects that already have their cost calculated. Querying additional scalars and enums within an object generally comes at a minimum cost.

In this example, shop is an object, costing 1. id, name, timezoneOffsetMinutes, and customerAccountsreturn are scalar types that cost 0. The total query cost is 1.

Connections: Two  Points Plus The Number of Returned Objects

Connections express one-to-many relationships in GraphQL. Shopify uses Relay-compliant connections, meaning they follow some conventions, such as compounding them by using edges, node, cursor, and pageInfo.

The edges object contains the fields describing the one-to-many relationship:

  • node: the list of objects returned by the query.
  • cursor: our current position on that list.

pageInfo holds the hasPreviousPage and hasNextPage boolean fields that help navigating through the list.

The cost for connections is two plus the number of objects the query expects to return. In this example, a connection that expects to return five objects has a cost of seven points:

cursor and pageInfo come free of charge as they’re the result of the heavy lifting already made by the connection object.

This query costs seven points just like the previous example:

Interfaces and Unions: One point

Interfaces and unions behave as objects that return different types, therefore they cost one point just like objects do.

Mutations: 10 points

Mutations are requests that produce side effects on databases and indexes, and can even trigger webhooks and email notifications. A higher cost is necessary to account for this increased server load so they’re 10 points. 

Getting Query Cost Information in GraphQL Responses

You don’t need to calculate query costs by yourself. The API responses include an extension object that includes the query cost. You can try running a query on Shopify Admin API GraphiQL explorer and see its calculated cost in action.

The request:

The response with the calculated cost displayed by the extension object:

Getting Detailed Query Cost Information in GraphQL Responses

You can get detailed per-field query costs in the extension object by adding the X-GraphQL-Cost-Include-Fields: true header to your request:

Understanding Requested Vs Actual Query Cost

Did you notice two different types of costs on the queries above?

  • The requested query cost is calculated before executing the query using static analysis.
  • The actual query cost is calculated while we execute the query.

Sometimes the actual cost is smaller than the requested cost. This usually happens when you query for a specific number of records in a connection, but fewer are returned. The good news is that any difference between the requested and actual cost is refunded to the API client.

In this example, we query the first five products with a low inventory. Only one product matches this query, so even though the requested cost is seven, you are only charged for the four points calculated by the actual cost:

Measuring the Effectiveness of the Calculated Query Cost Model

The calculated query complexity and execution time have a linear correlation
The calculated query complexity and execution time have a linear correlation

By using the query complexity calculation rules, we have a query cost that’s proportional to the server load measured by query execution time. This gives Shopify the predictability needed to scale our infrastructure, giving partners a stable platform for building apps. We can also detect outliers on this correlation and find opportunities for performance optimization.

Rate limiting GraphQL APIs by calculating the amount of data clients query or modify adapts more to the use case of each API client better than a request-based model commonly used by REST APIs.  Our calculated query cost method benefits clients with good API usage because it encourages them to request only the data they need, providing servers with a more predictable load.

Additional Information

Guilherme Vieira is a software developer on the API Patterns team. He loves building tools to help Partners and Shopifolk turn their ideas into products. He grew up a few blocks from a Formula 1 circuit and has been a fan of this sport ever since.


We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Continue reading

10 Lessons Learned From Online Experiments

10 Lessons Learned From Online Experiments

Controlled experiments are considered the gold standard approach for determining the true effect of a change. Despite that, there are still many traps, biases, and nuances that can spring up as you experiment, and lead you astray. Last week, I wrapped up my tenth online experiment, so here are 10 hard-earned lessons from the past year.

1. Think Carefully When Choosing Your Randomized Units

After launching our multi-currency functionality, our team wanted to measure the impact of our feature on the user experience, so we decided to run an experiment. I quickly decided that we randomly assign each session to test or control, and measure the impact on conversion (purchase) rates. We had a nicely modelled dataset at the session grain that was widely used for analyzing conversion rates, and consequently, the decision to randomize by session seemed obvious.

Online A/B test with session level randomization
Online A/B test with session level randomization

When you map out this setup, it looks something like the image below. Each session comes from some user, and we randomly assign their session to an experience. Our sessions are evenly split 50/50.

User experience with session level randomization
User experience with session level randomization

However, if you look a little closer, this setup could result in some users getting exposed to both experiences. It’s not uncommon for a user to have multiple sessions on a single store before they make a purchase.

If we look at our user Terry below, it’s not unrealistic to think that the “Version B” experience they had in Session 2 influenced their decision to eventually make a purchase in Session 3, which would get attributed to the “Version A” experience.

Carryover effects between sessions may violate the independent randomization units assumption
Carryover effects between sessions may violate the independent randomization units assumption

This led me to my first very valuable lesson that is to think carefully when choosing your randomization unit. Randomization units should be independent, and if they aren’t, you may not be measuring the true effect of your change. Another factor in choosing your randomization unit comes from the desired user experience. You can imagine that it’s confusing for some users if something significant was visibly different each time they came to a website. For a more involved discussion on choosing your randomization unit, check out my post: Choosing your randomization unit in online A/B tests.

2. Simulations Are Your Friend

With the above tip in mind, we decided to switch to user level randomization for the experiment, while keeping the session conversion rate as our primary success metric since it was already modelled and there was a high degree of familiarity with the metric internally.

However, after doing some reading I discovered that having a randomization unit (user) that’s different from your analysis unit (session) could lead to issues. In particular, there were some articles claiming that this could result in a higher rate of false positives. One of them showed a plot like this:

Distribution of p-values for session level conversion with users as a randomization unit
Distribution of p-values for session level conversion with users as a randomization unit

The intuition is that your results could be highly influenced by which users land in which group. If you have some users with a lot of sessions, and a really high or low session conversion rate, that could heavily influence the overall session conversion rate for that group.

Rather than throwing my hands up and changing the strategy, I decided to run a simulation to see if we would actually be impacted by this. The idea behind the simulation was to take our population and simulate many experiments where we randomized by user and compared session conversion rates like we were planning to do in our real experiment. I then checked if we saw a higher rate of false positives, and it turned out we didn’t so we decided to stick with our existing plan.

The key lesson here was that simulations are your friend. If you’re ever unsure about some statistical effect, it’s very quick (and fun) to run a simulation to see how you’d be affected, before jumping to any conclusions.

3. Data Can and Should Inform Big and Small Decisions

Data is commonly used to influence big decisions with an obvious need for quantitative evidence. Does this feature positively impact our users? Should we roll it out to everyone? But there’s also a large realm of much smaller decisions that can be equally influenced by data.

Around the time we were planning an experiment to test a new design for our geolocation recommendations, the system responsible for rendering the relevant website content was in the process of being upgraded. The legacy system (“Renderer 1”) was still handling approximately 15 percent of the traffic, while the new system (“Renderer 2”) was handling the other 85 percent. This posed a question to us: do we need to implement our experiment in the two different codebases for each rendering system? Based on the sizable 15 percent still going to “Renderer 1”, our initial thinking was yes. However, we decided to dig a bit deeper.

Flow of web requests to two different content rendering codebases
Flow of web requests to two different content rendering codebases

With our experiment design, we’d only be giving the users the treatment or control experience on the first request in a given session. With that in mind, the question we actually needed to answer changed. Instead of asking what percent of all requests across all users are served by “Renderer 2”, we needed to look at what percent of first requests in a session are served by “Renderer 2” for the users we planned to include in our experiment.

Flow of web requests to two different content rendering codebases after filtering out irrelevant requests
Flow of web requests to two different content rendering codebases after filtering out irrelevant requests

By reframing the problem, we learned that almost all of the relevant requests were being served by the new system, so we were safe to only implement our experiment in one code base.

A key lesson learned from this was that data can and should inform both big and small decisions. Big decisions like “should we roll out this feature to all users,” and small decisions like “should we spend a few days implementing our experiment logic in another codebase.” In this case, two hours of scoping saved at least two days of engineering work, and we learned something useful in the process.

This lesson wasn’t necessarily unique to this experiment, but it’s worth reinforcing. You can only identify these opportunities when you’re working very closely with your cross-discipline counterparts (engineering in this case), attending their standups, and hearing the decisions they’re trying to make. They usually won’t come to you with these questions as they may not think that this is something data can easily or quickly solve.

4, 5, and 6. Understand Your System, Log Generously, and Run More A/A Tests

For an experiment that involved redirecting the treatment group to a different URL, we decided to first run an A/A test to validate that redirects were working as expected and not having a significant impact on our metrics.

The A/A setup looked something like this:

  • A request for a URL comes into the back-end
  • The user, identified using a cookie, is assigned to control/treatment
    • The user and their assigned group is asynchronously logged to Kafka
  • If the user is in the control group, they receive the rendered content (html, etc.) they requested
  • If the user is in the treatment group, the server instead responds with a 302 redirect to the same URL
  • This causes the user in the treatment group to make another request for the same URL
    • This time, the server responds with the rendered content originally requested (a cookie is set in the previous step to prevent the user from being redirected again)

This may look like a lot, but for users this is virtually invisible. You’d only know if you were redirected if you opened your browser developer tools (under the “Network” tab you’ll see a request with a 302 status).

A/A experiment set up with a redirect to the same page
A/A experiment set up with a redirect to the same page

Shortly into the experiment, I encountered my first instance of sample ratio mismatch (SRM). SRM is when the number of subjects in each group doesn’t match your expectations.

After “inner joining” the assigned users to our sessions system of record, we were seeing a slightly lower fraction of users in the test group compared to the control group instead of the desired 50/50 split.

We asked ourselves why this could be happening. And in order to answer that question, we needed to understand how our system worked. In particular, how do records appear in the sessions data model, and what could be causing fewer records from our test group to appear in there?

Sample ratio mismatch in an A/A experiment with a redirect to the same page
Sample ratio mismatch in an A/A experiment with a redirect to the same page

After digging through the source code of the sessions data model, I learned that it’s built by aggregating a series of pageview events. These pageview events are emitted client side, which means that the “user” needs to download the html and javascript content our servers return, and then they will emit the pageview events to Kafka.

With this understanding in place, I now knew that some users in our test group were likely dropping off after the redirect and consequently not emitting the pageview events.

Data flows for an A/A experiment with a redirect to the same page
Data flows for an A/A experiment with a redirect to the same page

To better understand why this was happening, we added some new server-side logging for each request to capture some key metadata. Our main hypothesis was that this was being caused by bots, since they may not be coded to follow redirects. Using this new logging, I tried removing bots by filtering out different user agents and requests coming from certain IP addresses. This helped reduce the degree of SRM, but didn’t entirely remove it. It’s likely that I wasn’t removing all bots (as they’re notoriously hard to identify) or there were potentially some real users (humans) who were dropping off in the test group. Based on these results, I ended up changing the data sources used to compute our success metric and segment our users.

Despite the major head scratching this caused, I walked away with some really important lessons: 

  • Develop a deep understanding of your system. By truly understanding how redirects and our sessions data model worked, we were able to understand why we were seeing SRM and come up with alternatives to get rid of it.
  • Log generously. Our data platform team made it incredibly simple and low effort to add new Kafka instrumentation, so we took advantage. The new request logging we initially added for investigative purposes ended up being used in the final metrics.
  • Run more A/A tests. By running the A/A test, I was able to identify the sample ratio mismatch issues and update our metrics and data sources prior to running the final experiment. We also learned the impact of redirection alone that helped with the final results interpretation in the eventual A/B test where we had redirection to a different URL.

7 and 8. Beware of User Skew and Don’t Be Afraid to Peek

In one experiment where we were testing the impact of translating content into a buyer’s preferred language, I was constantly peeking at the results each day as I was particularly interested in the outcome. The difference in the success metric between the treatment and control groups had been holding steady for well over a week, until it took a nose dive in the last few days of the experiment.

Towards the end of the experiment, the results suddenly changed due to abnormal activity
Towards the end of the experiment, the results suddenly changed due to abnormal activity

After digging into the data, I found that this change was entirely driven by a single store with abnormal activity and very high volumes, causing it to heavily influence the overall result. This served as a pleasant reminder to beware of user skew. With any rate based metric, your results can easily be dominated by a set of high volume users (or in this case, a single high volume store).

And despite the warnings you’ll hear, don’t be afraid to peek. I encourage you to look at your results throughout the course of the experiment. Avoiding the peeking problem can only be done in conjunction with following a strict experiment plan to collect a predetermined sample size (that is, don’t get excited by the results and end the experiment early). Peeking at the results each day allowed me to spot the sudden change in our metrics and subsequently identify and remove the offending outlier.

9. Go Down Rabbit Holes

In another experiment involving redirects, I was once again experiencing SRM. There was a higher than expected number of sessions in one group. In past experiments, similar SRM issues were found to be caused by bots not following redirects or weird behaviour with certain browsers.

I was ready to chalk up this SRM to the same causes and call it a day, but there was some evidence that hinted something else may be at play. As a result, I ended up going down a big rabbit hole. The rabbit hole eventually led me to review the application code and our experiment qualification logic. I learned that users in one group had all their returning sessions disqualified from the experiment due to a cookie that was set in their first session.

For an ecommerce experiment, this has significant implications since returning users (buyers) are much more likely to purchase. It’s not a fair comparison if one group contains all sessions, and the other only contains the buyer’s first sessions. The results of the experiment changed from negative to positive overall after switching the analysis unit from session to user so that all user’s sessions were considered.

Another important lesson learned: go down rabbit holes. In this case, the additional investigation turned out to be incredibly valuable as the entire outcome of the experiment changed after discovering the key segment that was inadvertently excluded. The outcome of a rabbit hole investigation may not always be this rewarding, but at minimum you’ll learn something you can keep in your cognitive backpack.

10. Remember, We’re Measuring Averages

Oftentimes it may be tempting to look at your overall experiment results across all segments and call it a day. Your experiment is positive overall and you want to move on and roll out to the feature. This is a dangerous practice, as you can miss some really important insights.

Example experiment results across different segments
Example experiment results across different segments

As we report results across all segments, it’s important to remember that we’re measuring averages. Positive overall doesn’t mean positive for everyone and vice versa. Always slice your results across key segments and look at the results. This can identify key issues like a certain browser or device where your design doesn’t work, or a buyer demographic that’s highly sensitive to the changes. These insights are just important as the overall result, as they can drive product changes or decisions to mitigate these effects.

Going Forward…

So as you run more experiments remember:

  1. Think carefully when choosing your randomization unit
  2. Simulations are your friend
  3. Data can, and should inform both big & small decisions
  4. Understand your system
  5. Log generously
  6. Run more A/A tests
  7. Beware of user skew
  8. Don’t be afraid to peek
  9. Go down rabbit holes
  10. Remember, we’re measuring averages

I certainly will.

Ian Whitestone: Ian joined Shopify in 2019 and currently leads a data science team working to simplify cross border commerce and taxes. Connect with Ian on LinkedIn to hear about work opportunities or just to chat.


Are you passionate about data discovery and eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.

Continue reading

Querying Strategies for GraphQL Clients

Querying Strategies for GraphQL Clients

As more clients rely on GraphQL to query data, we witness performance and scalability issues emerging. Queries are getting bigger and slower, and net-new roll-outs are challenging. The web & mobile development teams working on Orders & Fulfillments spent some time exploring and documenting our approaches. On mobile, our goal was to consistently achieve a sub one second page load on a reliable network. After two years of scaling up our Order screen in terms of features, it was time to re-think the foundation on which we were operating to achieve our goal. We ran a few experiments in mobile and web clients to develop strategies around those pain points. These strategies are still a very open conversation internally, but we wanted to share what we’ve learned and encourage more developers to play with GraphQL at scale in their web and mobile clients. In this post, I’ll go through some of those strategies based on an example query and build upon it to scale it up.

1. Designing Our Base Query

Let’s take the case of a client loading a list of products. To power our list screen we use the following query:

Using this query, we can load the first 100 products and their details (name, price, and image). This might work great, as long as we have fewer than 100 products. As our app grows we need to consider scalability:

  • How can we prepare for the transition to a paginated list?
  • How can we roll out experiments and new features?
  • How can we make this query faster as it grows?

2. Loading Multiple Product Pages

Good news, our products endpoint is paginated on Shopify’s back-end side and can now implement the change on our clients! The main concern on the client side is to find the right page size because it could also have UX and Product implications. The right page size will likely change from one platform to another because we’re likely to display fewer products at the same time on the mobile client (due to less space). This weighs on the performances as the query grows.

In this step, a good strategy is to set performance tripwires, that is create some kind of score (based on loading times) to monitor our paginated query. Implementing pagination within our query immediately reduces the load on the back-end and front-end side if we opt for a lower number than the initial 100 products:

We add two parameters to control the page size and index. We also need to know if the next page is available to show, hence the hasNextPage field. Now that we have support for an unlimited amount of products in our query, we can focus on how we roll out new fields.

3. Controlling New Field Rollouts

Our product list is growing in terms of features, and we run multiple projects at the same time. To make sure we have control on how changes are rolled out in our ProductList query we use @include and @skip tags to make optional some of the net-new fields we’re rolling out. It looks like this:

In the example above the description field is hidden behind the $featureFlag parameter. It becomes optional, and you need to unwrap its value when parsing the response. If the value of $featureFlag is false, the response will return it as null.

The @include and @skip tags require any new field to keep the same naming and level as renaming or deleting those fields will likely result in breaking the query. A way around this problem is to dynamically build the query at runtime based on the feature flag value.

Other rollout strategies can involve duplicating queries and running a specific query based on feature flags or working off a side branch until rollout and deployment. Those strategies are likely project and platform specific and come with more trade-offs like complexity, redundant code, and scalability.

The @include and @skip tags solution is handy for flags on hand, but what about for conditional loading based on remote flags? Let’s have a look at chained queries!

4. Chaining Queries

From time to time you’ll need to chain multiple queries. A few scenarios where this might happen are

  • Your query relies on a remote flag that comes from another query. This makes rolling out features easier as you control the feature release remotely. On mobile clients with many versions in production, this is useful.
  • A part of your query relies on a remote parameter. Similar to the scenario above, you need the value of a remote parameter to power your field. This is usually tied to back-end limitations.
  • You’re running into pagination limitations with your UX. You need to load all pages on screen load and chain your queries until you reach the last page. This mostly happens in clients where the current UX doesn’t allow for pagination and is out of sync with the back-end updates. In this specific case solve the problem at a UX level if possible.

We transform our local feature flag into a remote flag and this is what our query looks like:

In the example above, the RemoteDescriptionFlag query is executed first, and we wait for its results to start the ProductsList query. The descriptionEnabled (aliased to remoteFlag) powers the @include inside our ProductsList query. This means we’re now waiting for two queries at every page or screen load to complete before we can display our list of products. It significantly slows down our performance. A way to work around this scenario is to move the remote flag query outside of this context, probably at an app-wide level.

The TL;DR of chained queries: only do it if necessary.

 

5. Using Parallel Queries

Our products list query is growing significantly with new features:

We added search filters, user permission, and banners. Those three parts aren’t tied to the products list pagination because if they were included in the ProductsList query, we have to re-query those three endpoints every time we ask for a new page. It slows down performance and gives redundant information. This doesn’t scale well with new features and endpoints, so this sounds like a good time to leverage parallel querying!

Parallel querying is exactly what it sounds like: running multiple queries at the same time. By splitting the query into scalable parts and leaving aside the “core” query of the screen, it brings the benefits to our client:

  • Faster screen load: since we’re querying those endpoints separately, the load is transferred to the back-end side instead. Fragments are resolved and queried simultaneously instead of being queued on the server-side. It’s also easier to scale server-side than client-side in this scenario.
  • Easier to contribute as the team grows: by having one endpoint per query, we diminish the risk of code conflict (for example, fixtures) and flag overlapping for new features. It also makes it easier to remove some endpoints.
  • Easier to introduce the possibility of incremental and partial rendering: As queries are completed, you can start to render content to create the illusion of a faster page load for users.
  • Removes the redundant querying by leaving our paginated endpoint in its own query: we only query for product pages after the initial query cycle.

Here’s an example of what our parallel queries look like:

Whenever one of those queries becomes too big, we apply the same principles and split again to accommodate for logic and performances. What’s too big? As a client developer, it’s up to you to answer this question by setting up goals and tripwires. Creating some kind of trackable score for loading time can help you make the decision on when to cut the query in multiple parts. This way the GraphQL growth in our products list is more organic ( an outcome that looks at scalability and developer happiness) and doesn't impact performance: each query can grow independently and reduces the amount of potential roll-out & code merge conflicts.

Just a warning when using parallel queries, when transferring the load server-side, make sure you set tripwires to avoid overloading your server. Consult with site reliability experts (SREs or at Shopify, production engineers), and back-end developers, they can help monitor the performances server-side when using parallel querying.

Another challenge tied to parallel queries, is to plug the partial data responses into the screen state’s. This is likely to require some refactor into the existing implementation. It could be a good opportunity to support partial rendering at the same time.

Over the past four years, I have worked on shipping and scaling features in the Orders mobile space at Shopify. Being at the core of our Merchants workflow gave me the opportunity to develop empathy for their daily work. Empowering them with better features meant that we had to scale our solutions. I have been using those patterns to achieve that, and I’m still discovering new ones. I love how flexible GraphQL is on the client-side! I hope you’ll use some of these querying tricks in your own apps. If you do, please reach out to us, we want to hear how you scale your GraphQL queries!

Additional Information on GraphQL

Théo Ben Hassen is a development manager on the Merchandising mobile team. His focus is on enabling mobile to reach its maximum impact through mobile capabilities. He's interested about anything related to mobile analytics and user experience.


We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Continue reading

Deleting the Undeletable

Deleting the Undeletable

At Shopify, we analyze a variety of events from our buyers and merchants to improve their experience and the platform, and empower their decision making. These events are collected via our streaming platform (Kafka) and stored in our data warehouse at a rate of tens of billions of events per day. Historically, these events were collected by a variety of teams in an ad-hoc manner and lacked any guaranteed structure, leading to a variety of usability and maintainability issues, and difficulties with fulfilling data subject rights requests. The image below depicts how these events were collected, stored in our data warehouse, and used in other online dashboards.

An animated gif showing how events were collected, stored in the data warehouse, and used in dashboards. The image represents Analytical Events as three yellow envelopes that are on the left hand side of the image. The Kafka pipeline in the centre of the image is represented by a blue cylindrical shape. The data warehouse, on the right hand side of the image, is a represented by a set of six grey circles stacked on top of each other. Below the data warehouse is a computer screen which represents the dashboards.  The animation shows yellow envelopes passing through the Kafka pipeline and continues to the data warehouse for sign up events or the dashboard for POS transactions
How events were collected, stored in our data warehouse, and used in other online dashboards in the old system.

Some of these events contained Personally Identifiable Information (PII), and in order to comply with regulation, such as European General Data Protection Regulation (GDPR), we needed to find data subject’s PII within our data warehouse and to access or delete (via privacy requests) them upon request in a timely manner. This quickly escalated to a very challenging task due to:

  • Lack of guaranteed structure and ownership: Most of these events were only meaningful to and parsable by their creators and didn’t have a fixed schema. Further, there was no easy way to figure out ownership of all of them. Hence, it was near impossible to automatically parse and search these events. Let alone accessing and deleting PII within them.

  • Missing data subject context: Even knowing where PII resided in this dataset isn’t enough to fulfill a privacy request. We needed a reliable way to know to whom this PII belongs and who is the data controller. For example, we act as a processor for our merchants when they collect customer data, and so we are only able to process customer deletion requests when instructed by the merchant (the controller of that personal data).

  • Scale: The size of the dataset (in order of Petabytes) made it difficult, costly and time consuming to do any full search. In addition, it continuously grows at billions of events per day. Hence any solution needs to be highly scalable to keep up with incoming online events as well as processing historic ones.

  • Missing dependency graph: Some of these events and datasets power critical tasks and jobs. Any disruption or change to them can severely affect our operations. However, due to lack of ownership and missing lineage information readily and easily available for each event group, it was hard to determine the full scope of disruption should any change to a dataset happen.

So we were left with finding a needle in an ever growing haystack. These challenges, as well as other maintainability and usability issues with this platform, brought up a golden opportunity for the Privacy team and the Data Science & Engineering team to collaborate and address them together. The rest of this blog post focuses on our collaboration efforts and the technical challenges we faced when addressing these issues in a large organization such as Shopify.

Context Collection

Lack of guaranteed schemas for events was the root cause of a lot of our challenges. To address this, we designed a schematization system that specified the structure of each event including types of each field, evolution (versions) context, ownership, as well as privacy context. The privacy context specifically includes marking sensitive data, identifying data subjects, and handling PII ( that is, what to do with PII).

Schemas are designed by data scientists or developers interested in capturing a new kind of event (or changing an existing one). They’re proposed in a human readable JSON format and then reviewed by team members for accuracy and privacy reasons. As of today, we have more than 4500 active schemas. This schema information is then used to enforce and guarantee the structure of every single event going through our pipeline at generation time.

Above shows a trimmed signup event schema. Let’s read through this schema and see what we learn from it:

The privacy_setting section specifies whose PII this event includes by defining a data controller and data subject. Data controller indicates the entity that decides why and how personal data is processed (Shopify in this example). Data subject designates whose data is being processed that’s tracked via email (of the person in question) in this schema. It’s worthwhile to mention, generally when we deal with buyer data, merchants are the data controller and Shopify plays the data processor role (a third party that processes personal data on behalf of a data controller).

Every field in a schema has a data-type and doc field, and a privacy block indicating if it contains sensitive data. The privacy block indicates what kind of PII is being collected under this field and how to handle that PII.

Our new schematization platform was successful in capturing the aforementioned context and it significantly increased privacy education and awareness among our data scientists and developers because of discussions on schema proposals about identifying personal data fields. In the vast majority of cases, the proposed schema contained all the proper context, but when required, or in doubt, privacy advice was available. This exemplified that when given accurate and simple tooling, everyone is inclined to do the right thing and respect privacy. Lastly, this platform helped with reusability, observability, and streamlining common tasks for the data scientists too. Our schematization platform signified the importance of capitalizing on shared goals across different teams in a large organization.

Personal Data Handling

At this point, we have flashy schemas that gather all the context we need regarding structure, ownership, and privacy for our analytical events. However, we still haven’t addressed the problem of how to handle and track personal information accurately in our data warehouse. In other words, after having received a deletion or access privacy request, how do we fetch and remove PII from our data warehouse?

The short answer: we won’t store any PII in our data warehouse. To facilitate this, we perform two types of transformation on personal data before entering our data warehouse. These transformations convert personal (identifying) data to non-personal (non-identifying) data, hence there’s no need to remove or report them anymore. It sounds counterintuitive since it seems data might be rendered useless at this point. We preserve analytical knowledge without storing raw personal data through what GDPR calls pseudonymisation, “the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information”. In particular, we employ two types of pseudonymisation techniques: Obfuscation and Tokenization.

It’s important to stress that personal data that’s undergone pseudonymisation and could be attributed to a natural person by the use of additional information, directly or indirectly, is still considered personal information under GDPR and requires proper safeguards. Hence when we said, we won’t have any PII in our data warehouse, it wasn’t entirely precise. However, it allows us to control personal data, reduce risk, and truly anonymize or remove PII when requested.

Obfuscation and Enrichment

When we obfuscate an IP address, we mask half of the bytes but include geolocation data at city and country level. In most cases, this is how the raw IP address was intended to be used for in the first place. This had a big impact on adoption of our new platform, and in some cases offered added value too.

In obfuscation, identifying parts of data are either masked or removed so the people whom the data describe remain anonymous. Our obfuscation operators don’t just remove identifying information, they enrich data with non-personal data as well. This often removes the need for storing personal data at all. However, a crucial point is to preserve the analytical value of these records in order for them to stay useful.

Looking at different types of PII and how they’re used, we quickly observed patterns. For instance, the main use case of a full user agent string is to determine operating system, device type, and major version that are shared among many users. But a user agent can contain very detailed identifying information including screen resolution, fonts installed, clock skew, and other bits that can identify a data subject, hence they’re considered PII. So, during obfuscation, all identifying bits are removed and replaced with generalized aggregate level data that data analysts seek. The table below shows some examples of different PII types and how they’re obfuscated and enriched.

PII Type

Raw Form

Obfuscated

IP Address

207.164.33.12

{

"masked": "207.164.0.0", "geo_country": "Canada"

}

User agent

CPU iPhone OS 9_3_2 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Mobile/13F69 Instagram 8.4.0 (iPhone7,2; iPhone OS 9_3_2; nb_NO; nb-NO; scale=2.00; 750x1334

{

"Family": "Instagram", "Major": "8",
"Os.Family": "iOS",
"Os.Major": "9",
"Device.Brand": "Apple",
"Device.Model": "iPhone7"

}

Latitude/Longitude

45.4215° N, 75.6972° W

45.4° N, 75.6° W

Email

 john@gmail.com

behrooz@example.com

REDACTED@gmail.com

 REDACTED@REDACTED.com

A keen observer might realize some of the obfuscated data might still be unique enough to identify individuals. For instance, when a new device like an iPhone is released, there might be few people who own that device, hence, leading to identification especially combined with other obfuscated data. To address these limitations, we hold a list of allowed devices, families, or versions that we’re certain have enough unique instances (more than a set threshold) and gradually add to this list (as more unique individuals are part of that group). It’s important to note that this still isn’t perfect anonymization, and it’s possible that an attacker combines enough anonymized and other data to identify an individual. However, that risk and threat model isn’t as significant within an organization where access to PII is more easily available.

Tokenization

Obfuscation is irreversible (the original PII is gone forever) and doesn’t suit every use case. There are times when data scientists require access to the actual raw PII values of PII (imagine preparing a list of emails to send a promotional newsletter). To address these needs, we built a tokenization engine that exchanges PII with a consistent random token. We then only store tokens in the data warehouse and not the raw PII. A separate secured vault service is in charge of storing the token to PII mapping. This way, if there’s a delete request only the mapping in the vault service needs removing and all the copies of that corresponding token across the data warehouse become effectively non-detokenizable (in other words, just a random string).

To understand the tokenization process better let’s go through an example. Let’s say Hooman is a big fan of AllBirds and GymShark products, and he purchases two pairs of shoes from AllBirds and a pair of shorts from GymShark to hit the trails! His purchase data might look like the table below before tokenization:

Email
Shop
Product
...
hooman@gmail.com
allbirds
Sneaker
hooman@gmail.com
Gymshark
Shorts
hooman@gmail.com
allbirds
Running Shoes
 
After tokenization is applied the before table will look like the table below:

Email

Shop

Product

...

Token123

allbirds

Sneaker

Token456

Gymshark

Shorts

Token123

allbirds

Running Shoes

There are two important observations in the after tokenization table:

  1. The same PII (hooman@gmail.com) was replaced by the same token(Token123) under the same data controller (allbirds shop) and data subject (Hooman). This is the consistency property of tokens.
  2. On the other hand, the same PII (hooman@gmail.com) got a different token (Token456) under a different data controller (merchant shop) even though the actual PII remained the same. This is the multi-controller property of tokens and allows data subjects to exercise their rights independently among different data controllers (merchant shops). For instance, if Hooman wants to be forgotten or deleted from allbirds, that shouldn’t affect his history with Gymshark.

Now let’s take a look inside how all of this information is stored within our tokenization vault service shown in table below.

Data Subject
Controller
Token
PII
hooman@gmail.com
allbirds
Token123
hooman@gmail.com
hooman@gmail.com
Gymshark
Token456
hooman@gmail.com
...
...
...
... 
The vault service holds token to PII mapping and the privacy context including data controller and subject. It uses this context to decide whether to generate a new token for the given PII or reuse the existing one. The consistency property of tokens allows data scientists to perform analysis without requiring access to the raw value of PII. For example, all orders of Hooman from GymShark could be tracked only by looking for Token456 across the orders tokenized dataset.

Now back to our original goal, let’s review how all of this helps with deletion of PII in our data warehouse (reporting and accessing PII requests is similar except, instead of deletion of target records, they’ll be reported back). If we store only obfuscated and tokenized PII in our datasets, essentially there will be nothing left in the data warehouse to delete after removing the mapping from the tokenization vault. To understand this let’s go through some examples of deletion requests and how it will affect our datasets as well as tokenization vault.

Data Subject Controller Token PII
hooman@gmail.com allbirds Token123 hooman@gmail.com
hooman@gmail.com Gymshark Token456 hooman@gmail.com
hooman@gmail.com Gymshark Token789
222-333-4444
eva@hotmail.com
Gymshark
Token011
IP 76.44.55.33
Assume the table above shows the current content of our tokenization vault, and these tokens are stored across our data warehouse in multiple datasets. Now Hooman sends a deletion request to Gymshark (controller) and subsequently Shopify (data processor) receives it. At this point, all that’s required to delete Hoomans PII under GymShark is to just locate rows with the following condition:

DataSubject == ‘hooman@gmail.com’ AND Controller == Gymshark

Which results in the rows identified with a star (*) in the table below:

Data Subject Controller Token PII
hooman@gmail.com allbirds Token123 hooman@gmail.com
* hooman@gmail.com Gymshark Token456 hooman@gmail.com
* hooman@gmail.com Gymshark Token789 222-333-4444
eva@hotmail.com Gymshark Token011 IP 76.44.55.33
Similarly, if Shopify needed to delete all Hooman’s PII across all controllers (shops), it would need to only look for rows that have Hooman as the data subject, highlighted below:
Data Subject Controller Token PII
* hooman@gmail.com allbirds Token123 hooman@gmail.com
* hooman@gmail.com Gymshark Token456 hooman@gmail.com
* hooman@gmail.com Gymshark Token789 222-333-4444
eva@hotmail.com Gymshark Token011 IP 76.44.55.33
Last but not least, the same theory applies to merchants too. For instance, assume (let’s hope that will never happen!) Gymshark (data subject) decides to close their shop and ask Shopify (data controller) to delete all PII controlled by them. In this case, we could do a search in with the following condition:

Controller == Gymshark

Which will result in rows indicated in table:

Data Subject Controller Token PII
hooman@gmail.com allbirds Token123 hooman@gmail.com
* hooman@gmail.com Gymshark Token456 hooman@gmail.com
* hooman@gmail.com Gymshark Token789 222-333-4444
* eva@hotmail.com Gymshark Token011 IP 76.44.55.33
Notice in all of these examples, there was nothing to do in the actual data warehouse since once the mapping of token ↔ PII is deleted, tokens effectively become consistent random strings. In addition, all of these operations can be done in fractions of a second whereas doing any task in a petabyte scale data warehouse can become very challenging, and time and resource consuming.

Schematization Platform Overview

So far we’ve learned about details of schematization, obfuscation, and tokenization. Now it’s time to put all of these pieces together in our analytical platform. The image below shows an overview of the journey of an event from when it’s fired until it’s stored in the data warehouse:

An animated gif overview of the journey of an event from when it’s fired until it’s stored in the data warehouse. On the left hand side is Analytical Events represented by three yellow envelopes. In the center of the image is a cylindrical object that represents the Scheme Repository. An arrow from the Scheme Repository points downward to the Kafka pipeline which is represented by a blue cylindrical object. On the right hand side of the image is the Tokenization Vault that is represented by a blue square with a vault lock. Underneath the vault is the data warehouse represented by six grey circles stacked on top of each other.

In this example:

  1. A SignUp event is triggered into the messaging pipeline (Kafka)
  2. A tool, Scrubber, intercepts the message in the pipeline and applies pseudonymisation on the content using the predefined schema fetched from the Schema Repository for that message
  3. The Scrubber identifies that the SignUp event contains tokenization operations too. It then sends the raw PII and Privacy Context to the Tokenization Vault.
  4. Tokenization Vault exchanges PII and Privacy Context for a Token and sends it back to the Scrubber
  5. Scrubber replaces PII in the content of the SignUp event with the Token
  6. The new anonymized and tokenized SignUp event is put back onto the message pipeline.
  7. The PII free SignUp event is stored in the Data warehouse.

In theory, this schematization platform can allow a PII free data warehouse for all new incoming events; however, in practice, there still exists some challenges to be addressed.

Lessons from Managing PII at Shopify Scale

Despite having a sound technical solution for classifying and handling PII in our data warehouse, Shopify scale made adoption and reprocessing of our historic data a difficult task. Here are some lessons that helped us in this journey.

Adoption

Having a solution versus adopting it are two different problems. Initially, with a sound prototype ready, we struggled getting approval and commitment from all stakeholders to implement this new tooling and rightly so. Looking back at all of these proposed changes and tools to an existing platform, it does seem like open heart surgery, and of course, you’d likely face resistance. There’s no bulletproof solution to this problem, or at least one that we knew! Let’s review a few factors that significantly helped us.

Make the Wrong Thing the Hard Thing

Make the right thing the default option. A big factor in the success and adoption of our tooling was to make our tooling the default and easy option. Nowadays, creating and collecting unstructured analytical events at Shopify is difficult and goes through a tedious process with several layers of approval. Whereas creating structured privacy-aware events is a quick, well documented, and automated task.

“Trust Me, It Will Work” Isn’t Enough!

Proving scalability and accuracy of the proposed tooling was critical to building trust in our approach. We used the same tooling and mechanisms that the Data Science & Engineering team uses to prove correctness, reconciliation. We showed the scalability of our tooling by testing it on real datasets and stress testing under order of magnitudes higher load.

Make Sure the Tooling Brings Added Value

Our new tooling is not only the default and easy way to collect events, but also offers added value and benefits such as:

  • Streamlined workflow: No need for multiple teams to worry about compliance and fulfilling privacy requests
  • Increased data enrichment: For instance geolocation data from IP, family, or device info from user agent strings is the information that data scientists are often after in the first place.
  • Shared privacy education: Our new schematization platform encourages asking about and discussing privacy concerns. They range from what’s PII to other topics like what can or can’t be done with PII. It brings clarity and education that wasn’t easily available before.
  • Increased dataset discoverability: Schemas for events allow us to automatically integrate with query engines and existing tooling, making datasets quick to be used and explored.

These examples are a big driver and encouragement in adoption of our new toolings.

Capitalizing on Shared Goals

Schematization isn’t only useful for privacy reasons, it helps with reusability and observability, reduces storage cost, and streamlines common tasks for the data scientists too. Both privacy and data teams are important stakeholders in this project and it made collaboration and adoption a lot easier because we capitalized on shared goals across different teams in a large organization.

Historic Datasets are several petabytes of historic events collected in our data warehouse prior to the schematization platform. Even after implementing the new platform, the challenge of dealing with large historic datasets remained. What made it formidable was the sheer amount of data that was hard to identify an owner, reprocess, and migrate without disturbing the production platform. In addition, it’s not particularly the most exciting kind of work either, hence it’s easy to get deprioritized.

A dependency graph showing a partial view of interdependency between analytical jobs. The graph is large and has many branches represented by green, pink, black, blue, and white boxes. Numerous black lines terminating with arrows connect the dependencies.
Intricate interdependencies between some of the analytical jobs depending on these datasets

The above image shows a partial view of the intricate interdependency between some of the analytical jobs depending on these datasets. Similar to adoption challenges, there’s no easy solution for this problem, but here are some practices that helped us in mitigating this challenge.

Organizational Alignment

Any task of this scale goes beyond the affected individuals, projects, or even teams. Hence an organizational commitment and alignment is required to get it done. People, teams, priorities, and projects might change, but if there’s organizational support and commitment for addressing privacy issues, the task can survive. Organizational alignment helped us to put out consistent messaging to various team leads that meant everyone understood the importance of the work. With this alignment in place, it was usually just a matter of working with leads to find the right balance of completing their contributions in a timely fashion without completely disrupting their roadmap.

Dedicated Task Force

These kinds of projects are slow and time consuming, in our case, it took over a year and several changes at individual and team levels happened. We understood the importance of having a team and project, so we didn’t depend on individuals. People come and go, but the project must carry on.

Tooling, Documentation, and Support

One of our goals was to minimize the amount of effort individual dataset owners and users needed to migrate their datasets to the new platform. We documented the required steps, built automation for tedious tasks, and created integrations with tooling that data scientists and librarians were already familiar with. In addition, having Engineering support with hurdles was important. For instance, on many occasions when performance or other technical issues came up, Engineering support was available to solve the problem. Time spent on building the tooling, documentation, and support procedures easily paid off in the long time run.

Regular Progress Monitoring

Questioning dependencies, priorities, and blockers regularly paid off because we found better ways. For instance, in a situation where x is considered a blocker for y maybe:

  • we can ask the team working on x to reprioritize and unblock y earlier.
  • both x and y can happen at the same time if the teams owning them align on some shared design choices.
  • there's a way to reframe x or y or both so that the dependency disappears.

We were able to do this kind of reevaluation because we had regular and constant progress monitoring to identify blockers.

New Platform Operational Statistics

Our new platform has been in production use for over two years. Nowadays, we have over 4500 distinct analytical schemas for events, each designed to capture certain metrics or analytics, and with their own unique privacy context. On average, these schemas generate roughly 20 billions events per day or approximately 230K events per second with peaks of over 1 million events per second during busy times. Every single one of these events is processed by our obfuscation and tokenization tools in accordance to its privacy context before being accessible in the data warehouse or other places.

Our tokenization vault holds more than 500 billions distinct PII to token mappings (approximately 200 TeraBytes) from which tens to hundreds of millions are deleted daily in response to privacy or shop purge requests. The magical part of this platform is that deletion happens instantaneously only in the tokenization vault without requiring any operation in the data warehouse or any other place where tokens are stored. This is the super power that enables us to delete data that used to be very difficult to identify, the undeletable. These metrics and the ease of fulfilling privacy requests proved the efficiency and scalability of our approach and new tooling.

As part of onboarding our historic datasets into our new platform, we rebuilt roughly 100 distinct datasets (approximately tens of petabytes of data in total) feeding hundreds of jobs in our analytical platform. Development, rollout, and reprocessing of our historical data altogether took about three years with help from 94 different individuals signifying the scale of effort and commitment that we put into this project.

We believe sharing the story of a metamorphosis in our data analytics platform to facilitate privacy requests is valuable because when we looked for industry examples, there were very few available. In our experience, schematization and a platform to capture the context including privacy and evolution is beneficial in analytical event collection systems. They enable a variety of opportunities in treating sensitive information and educating developers and data scientists on data privacy. In fact, our adoption story showed that people are highly motivated to respect privacy when they have the right tooling at their disposal.

Tokenization and obfuscation proved to be effective tools in helping with handling, tracking and deletion of personal information. They enabled us to efficiently delete the undeletable at a very large scale.

Finally, we learned that solving technical challenges isn’t the entire problem. It remains a tough problem to address organizational challenges such as adoption and dealing with historic datasets. While we didn’t have a bulletproof solution, we learned that bringing new value, capitalizing on shared goals, streamlining and automating processes, and having a dedicated task force to champion these kinds of big cross team initiatives are effective and helpful techniques.

Additional Information

Behrooz is a staff privacy engineer at Shopify where he works on building scalable privacy tooling and helps teams to respect privacy. He received his MSc in Computer Science at University of Waterloo in 2015.  Outside of the binary world, he enjoys being upside down (gymnastics) 🤸🏻, on a bike  🚵🏻‍♂️ , on skis ⛷, or in the woods. Twitter: @behroozshafiee


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

Updating Illustrations at Scale

Updating Illustrations at Scale

The Polaris team creates tools, education and documentation that helps unite teams across Shopify to build better experiences. We created Polaris, our design system, and continue to maintain it. We are a multidisciplinary team with a range of experience. Some people have been at Shopify for over 6 years and others, like me, are a part of our Dev Degree program.

Continue reading

Shipit! Presents: How We Write React Native Apps

Shipit! Presents: How We Write React Native Apps

On May 19, 2021, Shipit!, our monthly event series, presented How We Write React Native Apps. Join Colin Gray, Haris Mahmood, Guil Varandas, and Michelle Fernandez who are the developers setting the React Native standards at Shopify. They’ll share more on how we write performant React Native apps.

 

Q: What best practices can we follow when we’re building an app, like for accessibility, theming, typography, and so on?
A: Our Restyle and Polaris documentation cover a lot of this and is worth reading through to reference, or to influence your own decisions on best practices.

Q: How do you usually handle running into crashes or weird bugs that are internal to React Native? In my experience some of these can be pretty mysterious without good knowledge of React Native internals. Pretty often issues on GitHub for some of these "rare" bugs might stall with no solution, so working on a PR for a fix isn't always a choice (after, of course, submitting a well documented issue).
A: We rely on various debugging and observability tools to detect crashes and bug patterns. That being said, running into bugs that are internal to React Native isn’t a scenario that we have to handle often, and if it ever happens, we rely on the technical expertise of our teams to understand it and communicate, or fix it through the existing channels. That’s the beauty of open source!

Q: Do you have any guide on what needs to be flexible and what fixed size... and where to use margin or paddings? 
A: Try to keep most things flexible unless absolutely necessary. This results in your UI being more fluid and adaptable to various devices. We use fixed sizes mostly for icons and imagery. We utilize padding to create spacing within a component and margins to create spacing between components.

Q: Does your team use React Studio?
A: No but a few native android developers coming from the IntelliJ suite of editors have set up their IDE to allow them to code in React and Kotlin with code resolutions using one IDE

Q: Do you write automated tests using protractor/cypress or jest?
A: Jest is our go-to library for writing and running unit and integration tests.

Q: Is Shopify Brownfield app? If it is, how are you handling navigation with React Native and Native!!
A: Shop and POS are both React Native from the ground up, but we do have a Brownfield app in the works. We are adding React Native views in piecemeal, and so navigation is being handled by the existing navigation controllers. Wiring this up is work, no getting around that.

Q: How do you synchronize native (KMM) and React Native state
A: We try to treat React Native state as the “Source of Truth”. At startup, we pass in whatever is necessary for the module to begin its work, and any shared state is managed in React Native, and updated via the native module (updates from the native module are sent via EventEmitter). This means that the native module is only responsible for its internal state and shared state is kept in React Native. One exception to this in the Point of Sale app is the SQLite database. We access that entirely via a native module. But again there’s only one source of truth.


Q: How do you manage various screen sizes and responsive layouts in React Native? (Polaris or something else)
A: We try not to use fixed sizing values whenever possible resulting in UIs more able to adjust to various device sizes. The Restyle library allows you to define breakpoints and pass in different values for each breakpoint when defining styles. For example, you can pass in different font sizes or spacing values depending on the breakpoints you define.

Q: Are you using Reanimated 2 in production at Shopify?
A: We are! The Shop app uses Reanimated 2 in production today.

Q: What do you use to configure and manage your CI builds?
A: We use Buildkite. Check out these two posts to learn more


Q: In the early stage of your React Native apps did you use Expo, or it was never an option?
A: We explored it, but most of our apps so quickly needed to “eject” from that workflow. We eventually decided that we would create our React Native applications as “vanilla” applications. Expo is great though, and we encourage people to use it for their own side projects.

Q: Are the nightly QAs automatic? How is the QA cycle?
A: Nightly builds are created automatically on our main branch. These builds automatically get uploaded to a test distribution platform and the Shopifolk (product managers, designers, developers) who have the test builds installed can opt in to always be updated to the latest version. Thanks to the ShipIt tool, any feature branches with failing tests will never be allowed to be merged to main.

All our devs are responsible for QA of the app and ensuring that no regressions occur before release.

Q: Have you tried Loki?
A: Some teams have tried it, but Loki doesn’t work with our CI constraints.

Learn More About React Native at Shopify


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

How Shopify Built An In-Context Analytics Experience

How Shopify Built An In-Context Analytics Experience

Federica Luraschi & Racheal Herlihy 

Whether determining which orders to fulfill first or assessing how their customer acquisition efforts are performing, our merchants rely on data to make informed decisions about their business. In 2013, we made easy-access to data a reality for Shopify merchants with the Analytics section. The Analytics section lives within the Shopify admin (where merchants can login to manage their business) and it gives them access to data that helps them understand how their business is performing.

While the Analytics section is a great way to get data into merchants’s hands, we realized there was an opportunity to bring insights right within their workflows. That’s why we’ve launched a brand new merchant analytics experience that surfaces data in-context within the most visited page in the Shopify admin—the Orders page.

Below, we’ll discuss the motivation behind bringing analytics in-context and the data work that went into launching this new product. We’ll walk you through the exploratory analysis, data modeling, instrumentation, and success metrics work we conducted to bring analytics into our merchants’s workflow.

Moving Towards In-Context Analytics

Currently, almost all the data surfaced to merchants lives in the Analytics section in the Shopify admin. The Analytics section contains:

  • Overview Dashboard: A dashboard that captures a merchant’s business health over time, including dozens of metrics ranging from total sales to online store conversion rates.
  • Reports: Enables merchants to explore their data more deeply and view a wider, more customizable range of metrics.
  • Live View: Gives merchants a real-time view of their store activity by showcasing live metrics and a globe to highlight visitor activity around the world. 
An image of the Shopify Analytics section in Live View. The image consists of an image of the earth that's focused on Canada and the US. There are dots of blue and green across the US and Canada representing visitors and orders to the Shopify store.  At the bottom of the image is a dashboard showcasing Visitors right now, Total sessions, Total Sales, Total orders, Page views, and Customer behaviour.
Shopify Analytics section Live View


From user research, as well as quantitative analysis, we found merchants were often navigating back and forth between the Analytics section and different pages within the Shopify admin in order to make data-informed decisions. To enable merchants to make informed decisions faster, we decided to bring data to where it is most impactful to them—right within their workflow.

Our goal was to insert analytics into the most used page in the Shopify admin, where merchants view and fulfill their orders—aka the Orders page. Specifically, we wanted to surface real-time and historical metrics that give merchants insight into the health of their orders and fulfillment workflows.

Step 1: Choosing (and Validating) Our Metrics

First, we collaborated with product managers to look at merchant workflows on the Orders Page (for example how merchants fulfill their orders) and, just as important, the associated goals the merchant would have for that workflow (e.g. they want to reduce the time required for fulfillment). We compared these to the available data to identify:

  • The top-level metrics (for example median fulfillment time)
  • The dimensions for the metrics (for example location)
  • Complementary visualizations or data points to the top-level metrics that we would surface in reports (for example the distribution of the fulfillment time)

For every metric we identified, we worked through specific use cases to understand how a merchant would use the metric as part of their workflow. We wanted to ensure that seeing a metric would push merchants to a follow-up action or give them a certain signal. For example, if a merchant observes that their median fulfillment time for the month is higher than the previous month, they could explore this further by looking at the data over time to understand if this is a trend. This process validated that the metrics being surfaced would actually improve merchant workflows by providing them actionable data.

Step 2: Understanding the Shape of the Data

Shopify has over 1.7 million merchants all at different stages of their entrepreneurial journey. In order to build the best analytics experience it was important for our team to have an understanding of what the data would look like for different merchant segments. Some of the ways we segmented our merchants were by:
  • Order volumes (for example low, medium or high volume stores)
  • The stage of their entrepreneurial journey (for example stores that just started making sales to stores that have been running their business for years)
  • Different fulfillment and delivery processes (for example merchants that deliver orders themselves to merchants that use shipping carriers)
  • Geographic region
  • Industry

After segmenting our merchants, we looked at the “shape of the data” for all the metrics we wanted to surface. More specifically, it was important for us to answer the following questions for each metric:

  • How many merchants would find this metric useful?
  • What merchant segments would find it most useful?
  • How variable is this metric over time for different merchants?
  • What time period is this metric most valuable for?

These explorations helped us understand the data that was available for us to surface and also validate or invalidate our proposed metrics. Below are some examples of how what we noticed in the shape of data affected our product:

What we saw
Action we took
We don’t have the data necessary to compute a metric, or a metric is always 0 for a merchant segment
Only show the metric to the stores where the metric is applicable
A metric stays constant over time
The metric isn’t a sensitive enough health indicator to show in the Orders page
A metric is most useful for longer time periods
The metric will only be available in reports where a merchant could look at those longer time periods 

 

These examples highlight that looking at the real data for different merchant segments was crucial for us to build an analytics experience that was useful, relevant, and fit the needs of all merchants.

 An image showing the template for data highlights.  The template has a title, data question being answered, description of the findings including a space to place graphical analysis, and the Data Science team's recommendation.
The data highlights template slide

As you can imagine, these data explorations meant we were producing a lot of analyses. By the end of our project, we had collected hundreds of data points. To communicate these analyses clearly to our collaborators we produced a slide deck that contained one data point per slide, along with its implications and product recommendations. This enabled us to share our analyses in a consistent, digestible format, keeping everyone on our team informed.

Step 3: Prototyping the Experience with Real Data

Once we had explored the shape of the data, aligned on the metrics we wanted to surface to our merchants, and validated them, we worked with the UX team to ensure that they could prototype with data for the different merchant segments we outlined above.

In-context analytics fulfillment, shipping, and delivery times report
Example of prototyped data
 An image of a In-context analytics fulfillment, shipping, and delivery times report in the Shopify Admin.  The left hand side of the image is a Menu navigation feature with Reports highlighted. The report displayed is bar graph and below is that graph's numerical values represented in a table.
In-context analytics fulfillment, shipping, and delivery times report
 

When we started exploring data for real stores, we found ourselves often re-thinking our visualizations or introducing complementary metrics to the ones we already identified. For example, we initially considered a report that displayed the median fulfillment, delivery, and in-transit times by location. When looking at the data and prototyping the report, we noticed that there was a spread in the event durations. We identified that a histogram visualization of the distribution of the event durations would be the most informative to merchants. As data scientists, we could prototype the graphs with real data and explore new visualizations that we provided to our product, engineering, and UX collaborators, influencing the final visualizations. 

Step 4: Finalizing Our Findings and Metrics

Every metric in the experience (on the Orders page and reports) was powered by a different query, meaning that there were a lot of queries to keep track of. We wanted to make sure that all clients (Web, Android, and iOS) were using the same query with the same data source, fields, and aggregations. To make sure there was one source of truth for the queries powering the experience, the final step of this exploratory phase was to produce a data specification sheet.

This data specification sheet contained the following information for each metric in-context and in reports:

  • The merchant goal
  • The qualification logic (we only surface a metric to a merchant if relevant to them)
  • The query (this was useful to our developer team when we started building)
 A table representing a Data specification template.  The columns are Metric, Merchant goal, Qualification logic, and query.  The table has one row of sample data
Data specification sheet template example

Giving our whole team access to the queries powering the experience meant that anyone could look at data for stores when prototyping and building.

Step 5: Building and Productionizing Our Models

Once we had identified the metrics we wanted to surface in-context, we worked on the dataset design for all the new metrics that weren’t already modelled. This process involved a few different steps:

  1. Identifying the business processes that our model would need to support (fulfilling an order, marking an order in-transit, marking an order as delivered)
  2. Selecting the grain of our data (we chose the atomic grain of one row per fulfillment, in-transit or delivery event, to ensure our model would be compatible with any future product or dimension we wanted to add)
  3. The dimensions we would need to include
The last step was to finalize a schema for the model that would support the product’s needs.

Beyond the dataset design, one of our requirements was that we wanted to surface real-time metrics, meaning we needed to build a streaming pipeline. However, for a few different reasons, we decided to start by building a batch pipeline.

First, modelling our data in batch meant we could produce a model in our usual development environment, iron out the details of the model, and iterate on any data cleaning. The model we created was available for internal analysis before sending it to our production analytics service, enabling us to easily run sanity checks. Engineers on our team were also able to use the batch model’s data as a placeholder when building.

Second, given our familiarity with building models in the batch environment, we were able to produce the batch model quickly. This gave us the ability to iterate on it behind the scenes and gave the engineers on our team the ability to start querying the model, and using the data as a placeholder when building the experience.

Third, the batch model allowed us to backfill the historic data for this model, and eventually use the streaming pipeline to power the most recent data that wasn’t included in our model. Using a lambda architecture approach where historical data came from the batch data model, while the streaming model powered the most recent data not yet captured in batch, helped limit any undue pressure on our streaming infrastructure.

Step 6: Measuring Success

At Shopify, our primary success metrics always come back to: what is the merchant problem we’re solving? We use a framework that helps us define what we intend to answer and consider any potential factors that could influence our findings. The framework goes over the following questions:

  1. What is the problem?
  2. What is our hypothesis or expected outcome for resolving the problem?
  3. What signals can we use to determine if we are successful?
  4. What factors could contribute to seeing this signal go up or down, and which are good or bad?
  5. What is our baseline or goal?
  6. What additional context or segments should we focus on?

Here’s an example for this project using the above framework:

  1. Problem: as a merchant, I don’t want to have to dig for the data I need to inform decisions. I want data to be available right within my workflow.
  2. Hypothesis: providing in-context analytics within a merchant’s workflow will enable them to make informed decisions, and not require them to move between pages to find relevant data.
  3. Signals: merchants are using this data while completing operational tasks on the Orders page, and we see a decrease in their transitioning between the Orders page and Analytics section.
  4. Considerations: the signal and adoption may be lower than expected due to this being a new pattern. This may not necessarily be because the data wasn’t valuable to the merchant, but simply because merchants aren’t discovering it.
  5. Baseline: merchants transitioning between the Orders Page and Analytics section prior to release compared to post-release.
  6. Context: explore usage by segments like merchant business size - e.g. the larger a business the more likely they are to hire staff to fulfill orders, which may mean they aren’t interested in analyzing the data.

This is just one example and with each point in the framework there is a long list of things to consider. It’s also important to note that for this project there are other audiences who have questions of their own. For instance, our data engineers have goals around the performance of this new data model. Due to the fact that this project has multiple goals and audiences, combining all of these success metrics into one dashboard would be chaotic. That’s why we decided to create a dashboard for each goal and audience, documenting the key questions each dashboard would answer. If you’re interested in how we approach making dashboards at Shopify, check out our blog!

As for understanding how this feature impacts users, we’re still working on that last step to ensure there are no unintended negative impacts.

The Results

From the ideation and validation of the metrics, to prototyping and building the data models, to measuring success, data science was truly involved end-to-end for this project. With our new in-context analytics experience, merchants can see the health of their orders and fulfillment workflows right within their Orders page. More specifically, merchants are surfaced in-context data about their:

  • Total orders (overall and over time)
  • Number of ordered items
  • Number of returned items
  • Fulfilled orders (overall and over time)
  • Delivered orders
  • Median fulfillment time

These metrics capture data for the day, the last seven days, and last thirty days. For every time period, merchants can see a green (positive change) or red (negative change) colored indicator informing them of the metric’s health compared to a previous comparison period.

An image of a In-context analytics on the Orders page in the Shopify Admin. The left hand side of the image is a Menu navigation feature with Orders highlighted. The report displayed has a dashboard on the top displaying the last 30 days of aggregate order data. Below that dashboard is a table that shows Order Number, Date, Customer, Total, Payment status, Fulfillment status, Item, Delivery method, and tags data.
The in-context analytics experience on the Orders page

We also gave merchants the functionality to click on a metric to view reports that give them a more in-depth view of the data:

  • The Orders Over Time Report: Displays the total number of orders that were received over the selected time period. It includes total orders, average units (products) per transaction, average order value, and returned items.
  • Product Orders and Returns Report: Helps merchants understand which products are their best sellers and which get returned the most often.
  • Fulfillment, Shipping, and Delivery Times Report: Shows how quickly orders move through the entire fulfillment process, from order receipt to delivery to the customer.
  • Fulfillments Over Time Report: Showcases the total number of orders that were either fulfilled, shipped, or delivered over the selected time period.
An image of a In-context analytics fulfillments over time report in the Shopify Admin. The left hand side of the image is a Menu navigation feature with Reports highlighted. The report displayed is line graph with the y-axis represents number of fulfilled orders and the x-axis is time. Below the line graph is the table representation of the data that includes date, number of fulfilled orders, number of shipped orders and number of delivered orders
In-context analytics fulfillments over time report

What We Learned

There were a lot of key takeaways from this project that we plan to implement in future analytics projects, including:

  1. Collaborating with different disciplines and creating central documents as a source of truth. Working with and communicating effectively with various teams was an essential part of enabling data-informed decision making. Creating documents like the data highlights slidedeck and data specification sheet ensured our full team was kept up-to-date.

  2. Exploring and prototyping our experience with real, segmented data. We can’t stress this enough - our merchants come in all shapes and sizes, so it was critical for us to look at various segments and prototype with real data to ensure we were creating the best experience for all our merchants.

  3. Prototyping models in the batch environment before making them streaming. This was effective in derisking the modelling efforts and unblocking engineers.

So, what’s next? We plan to continue putting data into the hands of our merchants, when and where they need it most. We aim to make data accessible to merchants in more surfaces that involve their day-to-day workflows beyond their Orders page.

If you’re interested in building analytic experiences that help entrepreneurs around the world make more data-informed decisions, we’re looking for talented data scientists and data engineers to join our team.

Federica Luraschi:  Federica is a data scientist on the Insights team. In her last 3+ years at Shopify, she has worked on building and surfacing insights and analytics experiences to merchants. If you’d like to connect with Federica, reach out here.

Racheal Herlihy: Racheal has been a Data Scientist with Shopify for nearly 3 years. She works on the Insights team whose focus is on empowering merchants to make business informed decisions. Previously, Racheal helped co-found a social enterprise helping protect New Zealand’s native birds through sensor connected pest traps. If you’d like to get in touch with Racheal, you can reach out on LinkedIn.

Continue reading

Other Driven Developments

Other Driven Developments

Mental models within an industry, company, or even a person, change constantly. As methodologies mature, we see the long term effects our choices have wrought and can adjust accordingly. As a team or company grows, methodologies that worked well for five people may not work as well for 40 people. If all employees could keep an entire app in their head, we’d need fewer rules and checks and balances on our development, but that is not the case. As a result, we summarize things we notice have been implicit in our work.

Continue reading

Three Ways We Share Context at Shopify Engineering

Three Ways We Share Context at Shopify Engineering

To do your best work as a developer, you need context. A development community thrives when its members share context as a habit. That's why Shopify Engineering believes that sharing context is vital to our growth and success—as a company, as a team, and as individuals.

Context is an abstract term, and we use it to refer to the why, what, and how of our development philosophies, approaches, and choices. Although sharing context comes in a myriad of forms, three of the ways we do it at Shopify are through:

  • Our in-house Development Handbook
  • A vibrant developer presentation program, called Dev Talks
  • Podcasts that go deep on technical subjects.

Each of these programs is by and for developers at Shopify, and each has strong support from leadership. Let's take a brief look at each of these ways that we share context and how they benefit our development community.

Shopify Development Handbook

Managing content is always a challenge. In hi-tech organizations, where tools and technologies change frequently, products ship quickly, and projects can pivot from day to day, having access to up-to-date and relevant information is critical. In a whirlwind of change and information volatility, we try to keep a central core of context in our Development Handbook.

Origins

The origins of the Handbook go back to 2015, when Jean-Michel Lemieux (JML), our current Chief Technology Officer, joined Shopify. In his first role at the company, he needed to learn about the Shopify platform—quickly. The company and the product were both growing rapidly, and the documentation that existed was scattered. Much of the knowledge and expertise was only in the heads of Shopifolk, not written down anywhere.

So he started a simple Google Doc to try to get a grip on Shopify both technically and architecturally. As the doc got longer, it was clear that the content was valuable not just to him—this was a resource that every developer could contribute to and benefit from.

Soon, the doc took shape. It contained a history of Shopify development, an overview of the architecture, our development philosophy, and the technology stack we had chosen. It also included information about our production environment and guidance on how we handled incidents.

Becoming a Website

The next logical step was to evolve that document into an internal website that could be more easily searched and navigated, better maintained, and more visible across the whole engineering organization. A small team was formed to build the site, and in 2018 the Development Handbook site went live.

Since then, developers from all disciplines across Shopify have contributed hundreds of topics to the Handbook. Now, it also contains information about our development cultures and practices, using our key technologies, our principles of development, and a wealth of detailed content on how we deploy, manage, and monitor our code.

The process for adding content is developer-friendly, using the same GitHub processes of pull requests and reviews that developers use while coding. We use Markdown for easy content entry and formatting. The site runs on Middleman, and developers contribute to operations of the site itself, like a recent design refresh (including dark mode), improvements to search, and even adding client-side rendering of KaTeX equations.

A sample page from the Development Handbook website.  The top of the page contains the search functionality and hamburger menu. Below that is a breadcrumb menu feature.  The title of the Page is Data Stores and the copy is shown below the title.
Example topic from the internal Shopify Development Handbook

Handling Oversight and Challenges

The Handbook is essentially crowd-sourced, but it's also overseen, edited, and curated by the small but mighty Developer Context team. This team defines the site's information architecture, evangelizes the Handbook, and works with developers to assist them as they contribute and update content.

The Development Handbook allows our team to push knowledge about Sorbet, static-typing in Ruby, and our best practices to the rest of the company. It's a necessary resource for all our newcomers and a good one even for our experts. A must read.

Alexandre Terrasa, Staff Production Engineer

Despite its success as a central repository for technical content and context in Shopify’s Engineering org, the Handbook always faces challenges. Some content is better off residing in other locations, like GitHub, where it's closest to the code. Some content that has a limited audience might be better off in a standalone site. There’s constant and opposing pressures to either add content to the Handbook or to move content from the Handbook to elsewhere.

Keeping content fresh and up-to-date is also a never-ending job. To try to ensure content isn't created and then forgotten about, we have an ownership model that asks teams to be responsible for any topics that are closely related to their mandates and projects. However, this isn't sufficient, as engineering teams are prone to being reorganized and refocused on new areas.

We haven't found the sweet spot yet for addressing these governance challenges. However, sharing is in the DNA of Shopify developers, and we have a great community of contributors who update content proactively, while others update anything they encounter that needs to change. We're exploring a more comprehensive proactive approach to content maintenance, but we're not sure what final form that will take yet.

In all likelihood, there won't be a one-time fix. Just like the rest of the company, we'll adapt and change as needed to ensure that the Handbook continues to help Shopify developers get up to speed quickly and understand how and why we do the things we do.

Dev Talks

Similar to the Development Handbook, the Dev Talks program has a long history at Shopify. It started in 2014 as a weekly open stage to present demos, prototypes, experiments, technical findings, or any other idea that might resonate with fellow developers.

Although the team managing this program has changed over the years, and its name has gone through several iterations, the primary goals remain the same: it's a place for developers to share their leading-edge work, technology explorations, wins, or occasional failures, with their peers. The side benefits are the opportunity for developers to build their presentation skills and to be recognized and celebrated for their work. The Developer Context team took over responsibility for the program in 2018.

Before Shopify shifted to a fully remote work model, talks were usually presented in our large cafeteria spaces, which lent an informal and casual atmosphere to the proceedings, and where big screens were used to show off one's work. Most presenters used slides as a way of organizing their thoughts, but many talks were off the cuff, or purely demos.

A picture of Shopify employees gathering in the lunch room of the Ottawa Office for Dev Talks.  There are 5 long tables with several people sitting down. The tables are placed in front of a stage with a screen. On that screen is an image of Crash Bandicoot. There is a person standing at a lectern on the stage presenting.
Developers gather for a Dev Talk in the former Shopify headquarters in Ottawa, Canada

With the company's shift to Digital by Design in 2020, we had to change the model and decided on an on-demand approach. Now, developers record their presentations so they can be watched at any time by their colleagues. Talks don't have prescribed time limits, but most tend to be about 15 to 20 minutes long.

To ensure we get eyes on the talks, the Developer Context team promotes the presentations via email and Slack. The team set up a process to simplify signing up to do a talk and get it promoted, and created branding like a Dev Talks logo and presentation template. Dev Talks is a channel on the internal Shopify TV platform helping ensure the talks have a consistent home and are easy to find.

Dev Talks have given me the opportunity to share my excitement about the projects I've worked on with other developers. They have helped me develop confidence and improve my public speaking skills while also enabling me to build my personal brand at Shopify.

Adrianna Chang, Developer
Dev Talks Logo

Our on-demand model is still very new, so we'll monitor how it goes and determine if we need to change things up again. What's certain is that the desire for developers to share their technical context is strong, and the appetite to learn from colleagues is equally solid.

Internal Podcasts

The third way we share context is through podcasts. Podcasts aren't new, but they have surged in popularity in recent years. In fact, the founder and CEO of Shopify, Tobi Lütke, has hosted his own internal podcast, called Context, since 2017. This podcast has a wide thematic scope. Most of them, as we might expect from our CEO, have a technology spin, but they’re geared for a Shopify-wide audience.

To provide an outlet for technical conversations focused squarely on our developer community, the Technical Leadership Team (TLT)—a group of senior developers who help to ensure that Shopify makes great technical decisions—recently launched their own internal podcast, Shift. The goal of these in-depth conversations is to unpack technical decisions and dig deep into the context around them.

The Shift podcast is where we talk about ideas that are worth reinforcing in Shopify Engineering. All systems degrade over time, so this forum lets us ensure that the best parts are properly oiled.

Alex Topalov, Senior Development Manager

About once a month, several members of the TLT sit down virtually with a senior engineering leadership member to probe for specifics around technologies or technical concepts. And the leaders who are interviewed are in the hot seat for a while—these recorded podcasts can last up to an hour. Recent episodes have focused on conversations about machine learning and artificial intelligence at Shopify, the resiliency of our systems, and how we approach extensibility.

To ensure everyone can take advantage of the podcasts, they're made available in both audio and video formats, and a full transcript is provided. Because they’re lengthy deep dives, developers can listen to partial segments at the time of their choosing. The on-demand nature of these podcasts is valuable, and the data shows that uptake on them is strong.

We'll continue to measure the appetite for this type of detailed podcast format to make sure it's resonating with the developer community over the long run.

Context Matters

The three approaches covered here are just a sample of how we share context at Shopify Engineering. We use many other techniques including traditional sharing of information through email newsletters, Slack announcement channels, organization town halls with ask me anything (AMA) segments, video messages from executives, and team demos—not to mention good old-fashioned meetings.

We expect post-pandemic that we’ll reintroduce in-person gatherings where technical teams can come together for brief but intense periods of context sharing hand-in-hand with prototyping, team-building, and deep development explorations.

Our programs are always iterating and evolving to target what works best, and new ideas spring up regularly to complement our more formal programs like the Development Handbook, Dev Talks, and the Shift podcast. What matters most is that an environment is in place to promote, recognize, and celebrate the benefits of sharing knowledge and expertise, with solid buy-in from leadership.

Christopher writes and edits internal content for developers at Shopify and is on the Developer Context team. He’s a certified copy editor with expertise in content development and technical editing. He enjoys playing the piano and has recently been exploring works by Debussy and Rachmaninov.


We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Continue reading

How I Define My Boundaries to Prevent Burnout

How I Define My Boundaries to Prevent Burnout

There’s a way to build a model for your life and the other demands on you. It doesn’t have to be 100 hours a week. It might not even be 40 hours a week. Whatever sustainable model you come up with, prioritize fiercely and communicate expectations accordingly. I’ve worked this way for three decades through different phases of my life. The hours have changed, but the basic model and principles remain the same: Define the time, prioritize the work, and communicate those two effectively.

Continue reading

A Five-Step Guide for Conducting Exploratory Data Analysis

A Five-Step Guide for Conducting Exploratory Data Analysis

Have you ever been handed a dataset and then been asked to describe it? When I was first starting out in data science, this question confused me. My first thought was “What do you mean?” followed by “Can you be more specific?” The reality is that exploratory data analysis (EDA) is a critical tool in every data scientist’s kit, and the results are invaluable for answering important business questions.

Simply put, an EDA refers to performing visualizations and identifying significant patterns, such as correlated features, missing data, and outliers. EDA’s are also essential for providing hypotheses for why these patterns occur. It most likely won’t appear in your data product, data highlight, or dashboard, but it will help to inform all of these things.

Below, I’ll walk you through key tips for performing an effective EDA. For the more seasoned data scientists, the contents of this post may not come as a surprise (rather a good reminder!), but for the new data scientists, I’ll provide a potential framework that can help you get started on your journey. We’ll use a synthetic dataset for illustrative purposes. The data below does not reflect actual data from Shopify merchants. As we go step-by-step, I encourage you to take notes as you progress through each section!

Before You Start

Before you start exploring the data, you should try to understand the data at a high level. Speak to leadership and product to try to gain as much context as possible to help inform where to focus your efforts. Are you interested in performing a prediction task? Is the task purely for exploratory purposes? Depending on the intended outcome, you might point out very different things in your EDA.

With that context, it’s now time to look at your dataset. It’s important to identify how many samples (rows) and how many features (columns) are in your dataset. The size of your data helps inform any computational bottlenecks that may occur down the road. For instance, computing a correlation matrix on large datasets can take quite a bit of time. If your dataset is too big to work within a Jupyter notebook, I suggest subsampling so you have something that represents your data, but isn’t too big to work with.

The first 5 rows of our synthetic dataset. The dataset above does not reflect actual data from Shopify merchants.
The first 5 rows of our synthetic dataset. The dataset above does not reflect actual data from Shopify merchants.

Once you have your data in a suitable working environment, it’s usually a good idea to look at the first couple rows. The above image shows an example dataset we can use for our EDA. This dataset is used to analyze merchant behaviour. Here are a few details about the features:

  • Shop Cohort: the month and year a merchant joined Shopify
  • GMV (Gross Merchandise Volume): total value of merchandise sold.
  • AOV (Average Order Value): the average value of customers' orders since their first order.
  • Conversion Rate: the percentage of sessions that resulted in a purchase.
  • Sessions: the total number of sessions on your online store.
  • Fulfilled Orders: the number of orders that have been packaged and shipped.
  • Delivered Orders: the number of orders that have been received by the customer.

One question to address is “What is the unique identifier of each row in the data?” A unique identifier can be a column or set of columns that is guaranteed to be unique across rows in your dataset. This is key for distinguishing rows and referencing them in our EDA.

Now, if you’ve been taking notes, here’s what they may look like so far:

  • The dataset is about merchant behaviour. It consists of historic information about a set of merchants collected on a daily basis
  • The dataset contains 1500 samples and 13 features. This is a reasonable size and will allow me to work in Jupyter notebooks
  • Each row of my data is uniquely identified by the “Snapshot Date” and “Shop ID” columns. In other words, my data contains one row per shop per day
  • There are 100 unique Shop IDs and 15 unique Snapshot Dates
  • Snapshot Dates range from ‘2020-01-01’ to ‘2020-01-15’ for a total of 15 days

1. Check For Missing Data

Now that we’ve decided how we’re going to work with the data, we begin to look at the data itself. Checking your data for missing values is usually a good place to start. For this analysis, and future analysis, I suggest analyzing features one at a time and ranking them with respect to your specific analysis. For example, if we look at the below missing values analysis, we’d simply count the number of missing values for each feature, and then rank the features by largest amount of missing values to smallest. This is especially useful if there are a large amount of features.

Feature ranking by missing value counts
Feature ranking by missing value counts

Let’s look at an example of something that might occur in your data. Suppose a feature has 70 percent of its values missing. As a result of such a high amount of missing data, some may suggest to just remove this feature entirely from the data. However, before we do anything, we try to understand what this feature represents and why we’re seeing this behaviour. After further analysis, we may discover that this feature represents a response to a survey question, and in most cases it was left blank. A possible hypothesis is that a large proportion of the population didn’t feel comfortable providing an answer. If we simply remove this feature, we introduce bias into our data. Therefore, this missing data is a feature in its own right and should be treated as such.

Now for each feature, I suggest trying to understand why the data is missing and what it can mean. Unfortunately, this isn’t always so simple and an answer might not exist. That’s why an entire area of statistics, Imputation, is devoted to this problem and offers several solutions. What approach you choose depends entirely on the type of data. For time series data without seasonality or trend, you can replace missing values with the mean or median. If the time series does contain a trend but not seasonality, then you can apply a linear interpolation. If it contains both, then you should adjust for the seasonality and then apply a linear interpolation. In the survey example I discussed above, I handle missing values by creating a new category “Not Answered” for the survey question feature. I won’t go into detail about all the various methods here, however, I suggest reading How to Handle Missing Data for more details on Imputation.

Great! We’ve now identified the missing values in our data—let’s update our summary notes:

  • ...
  • 10 features contain missing values
  • “Fulfilled Orders” contains the most missing values at 8% and “Shop Currency” contains the least at 6%

2. Provide Basic Descriptions of Your Sample and Features

At this point in our EDA, we’ve identified features with missing values, but we still know very little about our data. So let’s try to fill in some of the blanks. Let’s categorize our features as either:

Continuous: A feature that is continuous can assume an infinite number of values in a given range. An example of a continuous feature is a merchant’s Gross Merchandise Value (GMV).

Discrete: A feature that is discrete can assume a countable number of values and is always numeric. An example of a discrete feature is a merchant’s Sessions.

Categorical: A feature that is discrete can only assume a finite number of values. An example of a discrete feature is a merchant’s Shopify plan type.

The goal is to classify all your features into one of these three categories.

GMV AOV Conversion Rate
62894.85 628.95 0.17
NaN 1390.86 0.07
25890.06  258.90 0.04
6446.36 64.46 0.20
47432.44 258.90 0.10

 Example of continuous features

Sessions Products Sold Fulfilled Orders
11 52 108
119 46 96
182 47 98
147 44 99
45 65 125

 Example of discrete features

Plan
Country
Shop Cohort
Plus
UK
Nan
Advanced Shopify
UK
2019-04
Plus
Canada
2019-04
Advanced Shopify
Canada
2019-06
Advanced Shopify
UK
 2019-06

 Example of categorical features

You might be asking yourself, how does classifying features help us? This categorization helps us decide what visualizations to choose in our EDA, and what statistical methods we can apply to this data. Some visualizations won’t work on all continuous, discrete, and categorical features. This means we have to treat groups of each type of feature differently. We will see how this works in later sections.

Let’s focus on continuous features first. Record any characteristics that you think are important, such as the maximum and minimum values for that feature. Do the same thing for discrete features. For categorical features, some things I like to check for are the number of unique values and the number of occurrences of each unique value. Let’s add our findings to our summary notes:

  •  ...
  • There are 3 continuous features, 4 discrete features, and 4 categorical features
  • GMV:
    • Continuous Feature
    •  Values are between $12.07 and $814468.03
    • Data is missing for one day…I should check to make sure this isn’t a data collection error”
  •  Plan:
    • Categorical Feature
    • Assumes four values: “Basic Shopify”, “Shopify”, “Advanced Shopify”, and “Plus”.
    • The value counts of “Basic Shopify”, “Shopify”, “Advanced Shopify”, and “Plus” are 255, 420, 450, and 375, respectively
    • There seems to be more merchants on the “Advanced Shopify” plan than on the “Basic” plan. Does this make sense?

3. Identify The Shape of Your Data

The shape of a feature can tell you so much about it. What do I mean by shape? I’m referring to the distribution of your data, and how it can change over time. Let’s plot a few features from our dataset:

GMV and Sessions behaviour across samples
GMV and Sessions behaviour across samples

If the dataset is a time series, then we investigate how the feature changes over time. Perhaps there’s a seasonality to the feature or a positive/negative linear trend over time. These are all important things to consider in your EDA. In the graphs above, we can see that AOV and Sessions have positive linear trends and Sessions emits a seasonality (a distinct behaviour that occurs in intervals). Recall that Snapshot Data and Shop ID uniquely define our data, so the seasonality we observe can be due to particular shops having more sessions than other shops in the data. In the line graph below, we see that the Sessions seasonality was a result of two specific shops: Shop 1 and Shop 51. Perhaps these shops have a higher GMV or AOV?

In line graph below we have Snapshot date on the x-axis and we see that the Sessions (y-axis) seasonality was a result of two specific shops: Shop 1 and Shop 51

Next, you’ll calculate the mean and variance of each feature. Does the feature hardly change at all? Is it constantly changing? Try to hypothesize about the behaviour you see. A feature that has a very low or very high variance may require additional investigation.

Probability Density Functions (PDFs) and Probability Mass Functions (PMFs) are your friends. To understand the shape of your features, PMFs are used for discrete features and PDFs for continuous features.

Example of feature density functions
Example of feature density functions

Here a few things that PMFs and PDFs can tell you about your data:

  • Skewness
  • Is the feature heterogeneous (multimodal)?
  • If the PDF has a gap in it, the feature may be disconnected.
  • Is it bounded?

We can see from the example feature density functions above that all three features are skewed. Skewness measures the asymmetry of your data. This might deter us from using the mean as a measure of central tendency. The median is more robust, but it comes with additional computational cost.

Overall, there are a lot of things you can consider when visualizing the distribution of your data. For more great ideas, I recommend reading Exploratory Data Analysis. Don’t forget to update your summary notes:

  • ...
  • AOV:
    • Continuous Feature
    • Values are between $38.69 and $8994.58
    • 8% of values missing
    • Observe a large skewness in the data.
    • Observe a positive linear trend across samples
  • Sessions:
    • Discrete Feature
    • Contains count data (can assume non-negative integer data)
    • Values are between 2 and 2257
    • 7.7% of values missing
    • Observe a large skewness in the data
    • Observe a positive linear trend across samples
    • Shops 1 and 51 have larger daily sessions and show a more rapid growth of sessions compared to other shops in the data
  • Conversion Rate:
    • Continuous Feature
    • Values are between 0.0 and 0.86
    • 7.7% of values missing
    • Observe a large skewness in the data

4. Identify Significant Correlations

Correlation measures the relationship between two variable quantities. Suppose we focus on the correlation between two continuous features: Delivered Orders and Fulfilled Orders. The easiest way to visualize correlation is by plotting a scatter plot with Delivered Orders on the y axis and Fulfilled Orders on the x axis. As expected, there’s a positive relationship between these two features.

Scatter plot showing positive correlation between features “Delivered Orders” and “Fulfilled Orders”
Scatter plot showing positive correlation between features “Delivered Orders” and “Fulfilled Orders”

If you have a high number of features in your dataset, then you can’t create this plot for all of them—it takes too long. So, I recommend computing the Pearson correlation matrix for your dataset. It measures the linear correlation between features in your dataset and assigns a value between -1 and 1 to each feature pair. A positive value indicates a positive relationship and a negative value indicates a negative relationship.

Correlation matrix for continuous and discrete features
Correlation matrix for continuous and discrete features

It’s important to take note of all significant correlation between features. It’s possible that you might observe many relationships between features in your dataset, but you might also observe very little. Every dataset is different! Try to form hypotheses around why features might be correlated with each other.

In the correlation matrix above, we see a few interesting things. First of all, we see a large positive correlation between Fulfilled Orders and Delivered Orders, and between GMV and AOV. We also see a slight positive correlation between Sessions, GMV, and AOV. These are significant patterns that you should take note of.

Since our data is a time series, we also look at the autocorrelation of shops. Computing the autocorrelation reveals relationships between a signal’s current value and its previous values. For instance, there could be a positive correlation between a shop’s GMV today and its GMV from 7 days ago. In other words, customers like to shop more on Saturdays compared to Mondays. I won’t go into any more detail here since Time Series Analysis is a very large area of statistics, but I suggest reading A Very Short Course on Time Series Analysis.

Shop Cohort 2019-01 2019-02 2019-03 2019-04 2019-05 2019-06
Plan
Adv. Shopify 71 102 27 71 87 73
Basic Shopify 45 44 42 59 0 57
Plus 29 55 69 44 71 86
Shopify 53 72 108 72 60 28

Contingency table for discrete features “Shop Cohort” an “Plan”

The methodology outlined above only applies to continuous and discrete features. There are a few ways to compute correlation between categorical features, however, I’ll only discuss one, the Pearson chi-square test. This involves taking pairs of discrete features and computing their contingency table. Each cell in the contingency table shows the frequency of observations. In the Pearson chi-square test, the null hypothesis is that the categorical variables in question are independent and, therefore, not related. In the table above, we compute the contingency table for two categorical features from our dataset: Shop Cohort and Plan. After that, we perform a hypothesis test using the chi-square distribution with a specified alpha level of significance. We then determine whether the categorical features are independent or dependent.

5. Spot Outliers in the Dataset

Last, but certainly not least, spotting outliers in your dataset is a crucial step in EDA. Outliers are significantly different from other samples in your dataset and can lead to major problems when performing statistical tasks following your EDA. There are many reasons why an outlier might occur. Perhaps there was a measurement error for that sample and feature, but in many cases outliers occur naturally.

Continuous feature box plots
Continuous feature box plots

The box plot visualization is extremely useful for identifying outliers. In the above figure, we observe that all features contain quite a few outliers because we see data points that are distant from the majority of the data.

There are many ways to identify outliers. It doesn’t make sense to plot all of our features one at a time to spot outliers, so how do we systematically accomplish this? One way is to compute the 1st and 99th percentile for each feature, then classify data points that are greater than the 99th percentile or less than the 1st percentile. For each feature, count the number of outliers, then rank them from most outliers to least outliers. Focus on the features that have the most outliers and try to understand why that might be. Take note of your findings.

Unfortunately, the aforementioned approach doesn’t work for discrete features since there needs to be an ordering to compute percentiles. An outlier can mean many things. Suppose our discrete feature can assume one of three values: apple, orange, or pear. For 99 percent of samples, the value is either apple or orange, and only 1 percent for pear. This is one way we might classify an outlier for this feature. For more advanced methods on detecting anomalies in categorical data, check out Outlier Analysis.

What’s Next?

We’re at the finish line and completed our EDA. Let’s review the main takeaways:

  • Missing values can plague your data. Make sure to understand why they are there and how you plan to deal with them.
  • Provide a basic description of your features and categorize them. This will drastically change the visualizations you use and the statistical methods you apply.
  • Understand your data by visualizing its distribution. You never know what you will find! Get comfortable with how your data changes across samples and over time.
  • Your features have relationships! Make note of them. These relationships can come in handy down the road.
  • Outliers can dampen your fun only if you don’t know about them. Make known the unknowns!

But what do we do next? Well that all depends on the problem. It’s useful to present a summary of your findings to leadership and product. By performing an EDA, you might answer some of those crucial business questions we alluded to earlier. Going forward, does your team want to perform a regression or classification on the dataset? Do they want it to power a KPI dashboard? So many wonderful options and opportunities to explore!

It’s important to note that an EDA is a very large area of focus. The steps I suggest are by no means exhaustive and if I had the time I could write a book on the subject! In this post, I share some of the most common approaches, but there’s so much more that can be added to your own EDA.

If you’re passionate about data at scale, and you’re eager to learn more, we’re hiring! Reach out to us or apply on our careers page.

Cody Mazza-Anthony is a Data Scientist on the Insights team. He is currently working on building a Product Analytics framework for Insights. Cody enjoys building intelligent systems to enable merchants to grow their business. If you’d like to connect with Cody, reach out on LinkedIn.

Continue reading

Dynamic ProxySQL Query Rules

Dynamic ProxySQL Query Rules

ProxySQL comes with a powerful feature called query rules. The main use of these rules at Shopify is to reroute, rewrite, or reject queries matching a specified regex. However, with great power comes great responsibility. These rules are powerful and can have unexpected consequences if used incorrectly. At Shopify’s scale, we’re running thousands of ProxySQL instances, so applying query rules to each one is a painful and time consuming process, especially during an incident. We’ve built a tool to help us address these challenges and make deploying new rules safe and scalable.

Continue reading

How Shopify Dynamically Routes Storefront Traffic

How Shopify Dynamically Routes Storefront Traffic

In 2019 we set out to rewrite the Shopify storefront implementation. Our goal was to make it faster. We talked about the strategies we used to achieve that goal in a previous post about optimizing server-side rendering and implementing efficient caching. To build on that post, I’ll go into detail on how the Storefront Renderer team tweaked our load balancers to shift traffic between the legacy storefront and the new storefront.

First, let's take a look at the technologies we used. For our load balancer, we’re running nginx with OpenResty. We previously discussed how scriptable load balancers are our secret weapon for surviving spikes of high traffic. We built our storefront verification system with Lua modules and used that system to ensure our new storefront achieved parity with our legacy storefront. The system to permanently shift traffic to the new storefront, once parity was achieved, was also built with Lua. Our chatbot, spy, is our front end for interacting with the load balancers via our control plane.

At the beginning of the project, we predicted the need to constantly update which requests were supported by the new storefront as we continued to migrate features. We decided to build a rule system that allows us to add new routing rules easily.

Starting out, we kept the rules in a Lua file in our nginx repository, and kept the enabled/disabled status in our control plane. This allowed us to quickly disable a rule without having to wait for a deploy if something went wrong. It proved successful, and at this point in the project, enabling and disabling rules was a breeze. However, our workflow for changing the rules was cumbersome, and we wanted this process to be even faster. We decided to store the whole rule as a JSON payload in our control plane. We used spy to create, update, and delete rules in addition to the previous functionality of enabling and disabling the rules. We only needed to deploy nginx to add new functionality.

The Power of Dynamic Rules

Fast continuous integration (CI) time and deployments are great ways to increase the velocity of getting changes into production. However, for time-critical use cases like ours removing the CI time and deployment altogether is even better. Moving the rule system into the control plane and using spy to manipulate the rules removed the entire CI and deployment process. We still require a “code review” on enabled spy commands or before enabling a new command, but that’s a trivial amount of time compared to the full deploy process used prior.

Before diving into the different options available for configuration, let’s look at what it’s like to create a rule with spy. Below are three images showing creating a rule, inspecting it, and then deleting it. The rule was never enabled, as it was an example, but that process requires getting approval from another member of the team. We are affecting a large share of real traffic on the Shopify platform when running the command spy storefront_renderer enable example-rule, so the rules to good code reviews still apply.

An example of how to create a
routing rule with spy via slack.
Adding a rule with spy
An example of how to describe an
existing rule with spy via slack.
Displaying a rule with spy
An example of how to describe an
existing rule with spy via slack.
Removing a rule with spy

Configuring New Rules

Now let’s review the different options available when creating new rules.

Option Name
Description Default  Example
rule_name
The identifier for the rule.
products-json
filters
A comma-separated list of filters.
is_product_list_json_read
shop_ids
A comma-separated list of shop ids to which the rule applies.
all

The rule_name is the identifier we use. It can be any string, but it’s usually descriptive of the type of request it matches.

The shop_ids option lets us choose to have a rule target all shops or target a specific shop for testing. For example, test shops allow us to test changes without affecting real production data. This is useful to test rendering live requests with the new storefront because verification requests happen in the background and don’t directly affect client requests.

Next, the filters option determines which requests would match that rule. This allows us to partition the traffic into smaller subsets and target individual controller actions from our legacy Ruby on Rails implementation. A change to the filters list does require us to go through the full deployment process. They are kept in a Lua file, and the filters option is a comma-separated list of function names to apply to the request in a functional style. If all filters return true, then the rule will match that request.

Above is an example of a filter, is_product_list_path, that lets us target HTTP GET requests to the storefront products JSON API implemented in Lua.

Option Name
Description
Default
Example
render_rate
The rate at which we render allowed requests.
0
1
verify_rate
The rate at which we verify requests.
0
0
reverse_verify_rate
The rate at which requests are reverse-verified when rendering from the new storefront.
0
0.001

Both render_rate and verify_rate allow us to target a percentage of traffic that matches a rule. This is useful for doing gradual rollouts of rendering a new endpoint or verifying a small sample of production traffic.

The reverse_verify_rate is the rate used when a request is already being rendered with the new storefront. It lets us first render the request with the new storefront and then sends the request to the legacy implementation asynchronously for verification. We call this scenario a reverse-verification, as it’s the opposite or reverse of the original flow where the request was rendered by the legacy storefront then sent to the new storefront for verification. We call the opposite flow forward-verification. We use forward-verification to find issues as we implement new endpoints and reverse-verifications to help detect and track down regressions.

Option Name

Description

Default

Example

self_verify_rate

The rate at which we verify requests in the nearest region.

0

0.001

 

Now is a good time to introduce self-verification and the associated self_verify_rate. One limitation of the legacy storefront implementation was due to how our architecture for a Shopify pod meant that only one region had access to the MySQL writer at any given time. This meant that all requests had to go to the active region of a pod. With the new storefront, we decoupled the storefront rendering service from the database writer and now serve storefront requests from any region where a MySQL replica is present.

However, as we started decoupling dependencies on the active region, we found ourselves wanting to verify requests not only against the legacy storefront but also against the active and passive regions with the new storefront. This led us to add the self_verify_rate that allows us to sample requests bound for the active region to be verified against the storefront deployment in the local region.

We have found the routing rules flexible, and it made it easy to add new features or prototypes that are usually quite difficult to roll out. You might be familiar with how we generate load for testing the system's limits. However, these load tests will often fall victim to our load shedder if the system gets overwhelmed. In this case, we drop any request coming from our load generator to avoid negatively affecting a real client experience. Before BFCM 2020 we wondered how the application behaved if the connections to our dependencies, primarily Redis, went down. We wanted to be as resilient as possible to those types of failures. This isn’t quite the same as testing with a load generation tool because these tests could affect real traffic. The team decided to stand up a whole new storefront deployment, and instead of routing any traffic to it, we used the verifier mechanism to send duplicate requests to it. We then disconnected the new deployment from Redis and turned our load generator on max. Now we had data about how the application performed under partial outages and were able to dig into and improve resiliency of the system before BFCM. These are just some of the ways we leveraged our flexible routing system to quickly and transparently change the underlying storefront implementation.

Implementation

I’d like to walk you through the main entry point for the storefront Lua module to show more of the technical implementation. First, here is a diagram showing where each nginx directive is executed during the request processing.

A flow chart showing the order
different Lua callbacks are run in the nginx request lifecycle.
Order in which nginx directives are run - source: github.com/openresty/lua-nginx-module

During the rewrite phase, before the request is proxied upstream to the rendering service, we check the routing rules to determine which storefront implementation the request should be routed to. After the check during the header filter phase, we check if the request should be forward-verified (if the request went to the legacy storefront) or reverse-verified (if it went to the new storefront). Finally, if we’re verifying the request (regardless of forward or reverse) in the log phase, we queue a copy of the original request to be made to the opposite upstream after the request cycle has completed.

In the above code snippet, the renderer module referenced in the rewrite phase and the header filter phase and the verifier module reference in the header filter phase and log phase, use the same function find_matching_rule from the storefront rules module below to get the matching rule from the control plane. The routing_method parameter is passed in to determine whether we’re looking for a rule to match for rendering or for verifying the current request.

Lastly, the verifier module uses nginx timers to send the verification request out of band of the original request so we don’t leave the client waiting for both upstreams. The send_verification_request_in_background function shown below is responsible for queuing the verification request to be sent. To duplicate the request and verify it, we need to keep track of the original request arguments and the response state from either the legacy storefront (in the case of a forward verification request) or the new storefront (in the case of a reverse verification request). This will then pass them as arguments to the timer since we won’t have access to this information in the context of the timer.

The Future of Routing Rules

At this point, we're starting to simplify this system because the new storefront implementation is serving almost all storefront traffic. We’ll no longer need to verify or render traffic with the legacy storefront implementation once the migration is complete, so we'll be undoing the work we’ve done and going back to the hardcoded rules approach of the early days of the project. Although that doesn’t mean the routing rules are going away completely, the flexibility provided by the routing rules allowed us to build the verification system and stand up a separate deployment for load testing. These features weren’t possible before with the legacy storefront implementation. While we won’t be changing the routing between storefront implementations, the rule system will evolve to support new features.


We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Derek Stride is a Senior Developer on the Storefront Renderer team. He joined Shopify in 2016 as an intern and transitioned to a full-time role in 2018. He has worked on multiple projects and has been involved with the Storefront Renderer rewrite since the project began in 2019.

Continue reading

Building Smarter Search Products: 3 Steps for Evaluating Search Algorithms

Building Smarter Search Products: 3 Steps for Evaluating Search Algorithms

By Jodi Sloan and Selina Li

Over 2 million users visit Shopify’s Help Center every month to find help solving a problem. They come to the Help Center with different motives: learn how to set up their store, find tips on how to market, or get advice on troubleshooting a technical issue. Our search product helps users narrow down what they’re looking for by surfacing what’s most relevant for them. Algorithms empower search products to surface the most suitable results, but how do you know if they’re succeeding at this mission?

Below, we’ll share the three-step framework we built for evaluating new search algorithms. From collecting data using Kafka and annotation, to conducting offline and online A/B tests, we’ll share how we measure the effectiveness of a search algorithm.

The Challenge

Search is a difficult product to build. When you input a search query, the search product sifts through its entire collection of documents to find suitable matches. This is no easy feat as a search product’s collection and matching result set might be extensive. For example, within the Shopify Help Center’s search product lives thousands of articles, and a search for “shipping” could return hundreds. 

We use an algorithm we call Vanilla Pagerank to power our search. It boosts results by the total number of views an article has across all search queries. The problem is that it also surfaces non-relevant results. For example, if you conducted a search for “delete discounts” the algorithm may prefer results with the keywords “delete” or “discounts”, but not necessarily results on “deleting discounts”. 

We’re always trying to improve our users’s experience by making our search algorithms smarter. That’s why we built a new algorithm, Query-specific Pagerank, which aims to boost results with high click frequencies (a popularity metric) from historic searches containing the search term. It basically boosts the most frequently clicked-on results from similar searches. 

The challenge is any change to the search algorithms might improve the results for some queries, but worsen others. So to have confidence in our new algorithm, we use data to evaluate its impact and performance against the existing algorithm. We implemented a simple three-step framework for evaluating experimental search algorithms.

1. Collect Data

Before we evaluate our search algorithms, we need a “ground truth” dataset telling us which articles are relevant for various intents. For Shopify’s Help Center search, we use two sources: Kafka events and annotated datasets.

Events-based Data with Kafka

Users interact with search products by entering queries, clicking on results, or leaving feedback. By using a messaging system like Kafka, we collect all live user interactions in schematized streams and model these events in an ETL pipeline. This culminates in a search fact table that aggregates facts about a search session based on behaviours we’re interested in analyzing.

A simplified model of a search fact generated by piecing together raw Kafka events. The model shows that 3 Kafka events make up a Search fact: 1. search query, 2. search result click, and 3. contacted support.
A simplified model of a search fact generated by piecing together raw Kafka events

With data generated from Kafka events, we can continuously evaluate our online metrics and see near real-time change. This helps us to:

  1. Monitor product changes that may be having an adverse impact on the user experience.
  2. Assign users to A/B tests in real time. We can set some users’s experiences to be powered by search algorithm A and others by algorithm B.
  3. Ingest interactions in a streaming pipeline for real-time feedback to the search product.

Annotation

Annotation is a powerful tool for generating labeled datasets. To ensure high-quality datasets within the Shopify Help Center, we leverage the expertise of the Shopify Support team to annotate the relevance of our search results to the input query. Manual annotation can provide high-quality and trustworthy labels, but can be an expensive and slow process, and may not be feasible for all search problems. Alternatively, click models can build judgments using data from user interactions. However, for the Shopify Help Center, human judgments are preferred since we value the expert ratings of our experienced Support team.

The process for annotation is simple:

  1. A query is paired with a document or a result set that might represent a relative match.
  2. An annotator judges the relevance of the document to the question and assigns the pair a rating.
  3. The labels are combined with the inputs to produce a labelled dataset.
A diagram showing the 3 steps in the annotation process
Annotation Process

There are numerous ways we annotate search results:

  • Binary ratings: Given a user’s query and a document, an annotator can answer the question Is this document relevant to the query? by providing a binary answer (1 = yes, 0 = no).
  • Scale ratings: Given a user’s query and a document, a rater can answer the question How relevant is this document to the query? on a scale (1 to 5, where 5 is the most relevant). This provides interval data that you can turn into categories, where a rating of 4 or 5 represents a hit and anything lower represents a miss.
  • Ranked lists: Given a user’s query and a set of documents, a rater can answer the question Rank the documents from most relevant to least relevant to the query. This provides ranked data, where each query-document pair has an associated rank.

The design of your annotation options depends on the evaluation measures you want to employ. We used Scale Ratings with a worded list (bad, ok, good, and great) that provides explicit descriptions on each label to simplify human judgment. These labels are then quantified and used for calculating performance scores. For example, when conducting a search for “shipping”, our Query-specific Pagerank algorithm may return documents that are more relevant to the query where the majority of them are labelled “great” or “good”.

One thing to keep in mind with annotation is dataset staleness. Our document set in the Shopify Help Center changes rapidly, so datasets can become stale over time. If your search product is similar, we recommend re-running annotation projects on a regular basis or augmenting existing datasets that are used for search algorithm training with unseen data points.

2. Evaluate Offline Metrics

After collecting our data, we wanted to know whether our new search algorithm, Query-specific Pagerank, had a positive impact on our ranking effectiveness without impacting our users. We did this by running an offline evaluation that uses relevance ratings from curated datasets to judge the effectiveness of search results before launching potentially-risky changes to production. By using offline metrics, we run thousands of historical search queries against our algorithm variants and use the labels to evaluate statistical measures that tell us how well our algorithm performs. This enables us to iterate quickly since we simulate thousands of searches per minute.

There are a number of measures we can use, but we’ve found Mean Average Precision and Normalized Discounted Cumulative Gain to be the most useful and approachable for our stakeholders.

Mean Average Precision

Mean Average Precision (MAP) measures the relevance of each item in the results list to the user’s query with a specific cutoff N. As queries can return thousands of results, and only a few users will read all of them, only top N returned results need to be examined. The top N number is usually chosen arbitrarily or based on the number of paginated results. Precision@N is the percentage of relevant items among the first N recommendations. MAP is calculated by averaging the AP scores for each query in our dataset. The result is a measure that penalizes returning irrelevant documents before relevant ones. Here is an example of how MAP is calculated:

An example of how MAP is calculated for an algorithm, given 2 search queries as inputs
Example of MAP Calculation

Normalized Discounted Cumulative Gain

To compute MAP, you need to classify if a document is a good recommendation by determining a cutoff score. For example, document and search query pairs that have ok, good, and great labels (that is scores greater than 0) will be categorized as relevant. But the differences in the relevancy of ok and great pairs will be neglected.

Discounted Cumulative Gain (DCG) addresses this drawback by maintaining the non-binary score while adding a logarithmic reduction factor—the denominator of the DCG function to penalize the recommended items with lower positions in the list. DCG is a measure of ranking quality that takes into account the position of documents in the results set.

An example of calculating and comparing DCG of two search algorithms. The scale ratings of each query and document pair are determined by annotators
Example of DCG Calculation

One issue with DCG is that the length of search results differ depending on the query provided. This is problematic because the more results a query set has, the better the DCG score, even if the ranking doesn’t make sense. Normalized Discounted Cumulative Gain Scores (NDCG) solves this problem. NDCG is calculated by dividing DCG by the maximum possible DCG score—the score calculated from the sorted search results.

Comparing the numeric values of offline metrics is great when the differences are significant. The higher value, the more successful the algorithm. However, this only tells us the ending of the story. When the results aren’t significantly different, the insights we get from the metrics comparison are limited. Therefore, understanding how we got to the ending is also important for future model improvements and developments. You gather these insights by breaking down the composition of queries to look at:

  • Frequency: How often do our algorithms return worse results than annotation data?
  • Velocity: How far off are our rankings in different algorithms?
  • Commonalities: Understanding the queries that consistently have positive impacts on the algorithm performance, and finding the commonality among these queries, can help you determine the limitations on an algorithm.

Offline Metrics in Action

We conducted a deep dive analysis on evaluating offline metrics using MAP and NDCG to assess the success of our new Query-specific Pagerank algorithm. We found that our new algorithm returned higher scored documents more frequently and had slightly better scores in both offline metrics.

3. Evaluate Online Metrics

Next, we wanted to see how our users interact with our algorithms by looking at online metrics. Online metrics use search logs to determine how real search events perform. They’re based on understanding users’ behaviour with the search product and are commonly used to evaluate A/B tests.

The metrics you choose to evaluate the success of your algorithms depends on the goals of your product. Since the Shopify Help Center aims to provide relevant information to users looking for assistance with our platform, metrics that determine success include:

  • How often users interact with search results
  • How far they had to scroll to find the best result for their needs
  • Whether they had to contact our Support team to solve their problem.

When running an A/B test, we need to define these measures and determine how we expect the new algorithm to move the needle. Below are some common online metrics, as well as product success metrics, that we used to evaluate how search events performed:

  • Click-through rate (CTR): The portion of users who clicked on a search result when surfaced. Relevant results should be clicked, so we want a high CTR.
  • Average rank: The average rank of clicked results. Since we aim for the most relevant results to be surfaced first, and clicked, we want to have a low average rank.
  • Abandonment:When a user has intent to find help but they didn’t interact with search results, didn’t contact us, and wasn’t seen again. We always expect some level of abandonment due to bots and spam messages (yes, search products get spam too!), so we want this to be moderately low.
  • Deflection: Our search is a success when users don’t have to contact support to find what they’re looking for. In other words, the user was deflected from contacting us. We want this metric to be high, but in reality we want people to contact us when it’s what is best for their situation, so deflection is a bit of a nuanced metric.

We use the Kafka data collected in our first step to calculate these metrics and see how successful our search product is over time, across user segments, or between different search topics. For example, we study CTR and deflection rates between users in different languages. We also use A/B tests to assign users to different versions of our product to see if our new version significantly outperforms the old.

Flow of Evaluation Online Metrics
Flow of Evaluation Online Metrics

A/B testing search is similar to other A/B tests you may be familiar with. When a user visits the Help Center, they are assigned to one of our experiment groups, which determines which algorithm their subsequent searches will be powered by. Over many visits, we can evaluate our metrics to determine which algorithm outperforms the other (for example, whether algorithm A has a significantly higher click-through rate with search results than algorithm B).

Online Metrics in Action

We conducted an online A/B test to see how our new Query-specific Pagerank algorithm measured against our existing algorithm, with half of our search users assigned to group A (powered by Vanilla Pagerank) and half assigned to group B (powered by Query-specific Pagerank). Our results showed that users in group B were:

  • Less likely to click past the first page of results
  • Less likely to conduct a follow-up search
  • More likely to click results
  • More likely to click the first result shown
  • Had a lower average rank of clicked results

Essentially, group B users were able to find helpful articles with less effort when compared to group A users.

The Results

After using our evaluation framework to measure our new algorithm against our existing algorithm, it was clear that our new algorithm outperformed the former in a way that was meaningful for our product. Our metrics showed our experiment was a success, and we were able to replace our Vanilla Pagerank algorithm with our new and improved Query-specific Pagerank algorithm.

If you’re using this framework to evaluate your search algorithms, it’s important to note that even a failed experiment can help you learn and identify new opportunities. Did your offline or online testing show a decrease in performance? Is a certain subset of queries driving the performance down? Are some users better served by the changes than other segments? However your analysis points, don’t be afraid to dive deeper to understand what’s happening. You’ll want to document your findings for future iterations.

Key Takeaways for Evaluating a Search Algorithm

Algorithms are the secret sauce of any search product. The efficacy of a search algorithm has a huge impact on a users’ experience, so having the right process to evaluate a search algorithm’s performance ensures you’re making confident decisions with your users in mind.

To quickly summarize, the most important takeaways are:

  • A high-quality and reliable labelled dataset is key for a successful, unbiased evaluation of a search algorithm.
  • Online metrics provide valuable insights on user behaviours, in addition to algorithm evaluation, even if they’re resource-intensive and risky to implement.
  • Offline metrics are helpful for iterating on new algorithms quickly and mitigating the risks of launching new algorithms into production.

Jodi Sloan is a Senior Engineer on the Discovery Experiences team. Over the past 2 years, she has used her passion for data science and engineering to craft performant, intelligent search experiences across Shopify. If you’d like to connect with Jodi, reach out on LinkedIn.

Selina Li is a Data Scientist on the Messaging team. She is currently working to build intelligence in conversation and improve merchant experiences in chat. Previously, she was with the Self Help team where she contributed to deliver better search experiences for users in Shopify Help Center and Concierge. If you would like to connect with Selina, reach out on Linkedin.


If you’re a data scientist or engineer who’s passionate about building great search products, we’re hiring! Reach out to us or apply on our careers page.

Continue reading

How to Build a Web App with and without Rails Libraries

How to Build a Web App with and without Rails Libraries

How would we build a web application only using the standard Ruby libraries? In this article, I’ll break down the key foundational concepts of how a web application works while building one from the ground up. If we can build a web application only using Ruby libraries, why would we need web server interfaces like Rack and web applications like Ruby on Rails? By the end of this article, you’ll gain a new appreciation for Rails and its magic.

Continue reading

Remove Circular Dependencies by Using the Repository Pattern in Ruby

Remove Circular Dependencies by Using the Repository Pattern in Ruby

There are dependencies between gems and the platforms that use them. In scenarios where the platforms have the data and the gem has the knowledge, there is a direct circular dependency between the two and both need to talk to each other. I’ll show you how we used the Repository pattern in Ruby to remove that circular dependency and help us make gems thin and stateless. Plus, I’ll show you how using Sorbet in the implementation made our code typed and cleaner.

Continue reading

Updates on Shopify’s Bug Bounty Program

Updates on Shopify’s Bug Bounty Program

For three years we, Shopify’s Application Security team, have set aside time to reflect on our bug bounty program and share recent insights. This past year has been quite a ride, as our program has been busier than ever! We’re excited to share what we have learned, and share some of the great things we have planned!

Continue reading

4 Tips for Shipping Data Products Fast

4 Tips for Shipping Data Products Fast

Shipping a new product is hard. Doing so under tight time constraints is even harder. It’s no different for data-centric products. Whether it’s a forecast, a classification tool, or a dashboard, you may find yourself in a situation where you need to ship a new data product in a seemingly impossible timeline. 

Shopify’s Data Science team has certainly found itself in this situation more than a few times over the years. Our team focuses on creating data products that support our merchants’ entrepreneurial journeys, from their initial interaction with Shopify, to their first sale, and throughout their growth journey on the platform. Commerce is a fast changing industry, which means we have to build and ship fast to ensure we’re providing our merchants with the best tools to help them succeed.

Along the way, our team learned a few key lessons for shipping data products quickly, while maintaining focus and getting things done efficiently—but also done right. Below are four tips that are proven to help you ship new products fast. 

1. Utilize Design Sprints 

Investing time in a design sprint pays off down the line as you approach deadlines. The design sprint (created by Google Ventures) is “a process for answering critical business questions through design, prototyping and testing ideas with customers.” Sprints are great for getting a new data product off the ground quickly because they carve out specific time blocks and resources for you and your team to work on a problem. The Shopify Data Science teams make sprints a common practice, especially when we’re under a tight deadline. When setting up new sprints, here are the steps we like to take:

  1. Choose an impactful problem to tackle. We focus on solving problems for our merchants, but in order to do that, we first have to uncover what those problems are by asking questions. What is the problem we’re trying to solve? Why are we solving this problem? Asking questions empowers you to find a problem worth tackling, identify the right technical solution and ultimately drive impact.
  2. Assemble a small sprint team: Critical to the success of any sprint is assembling a small team (no more than 6 or 7) of highly motivated individuals. Why a small team? It’s easier to stay aligned in a smaller group due to better communication and transparency, which means it’s easier to move fast.
  3. Choose a sprint Champion: This individual should be responsible for driving the direction of the project and making decisions when needed (should we use solution A or B?). Assigning a Champion helps remove ambiguity and allow the rest of the team to focus their energy on solving the problem in front of them.
  4. Set your sprint dates: Timeboxing is one of the main reasons why sprints are so effective. By setting fixed dates, you're committing your team to focus on shipping on a precise timeline. Typically, a sprint lasts up to five days. However, the timeline can be shortened based on the size of the project (for example, three days is likely enough time for creating the first version of a dashboard that follows the impact of COVID-19 on the business’s acquisition funnel).

With your problem identified, your team set up, and your dates blocked off, it’s now time to sprint. Keep in mind while exploring solutions that solving a data-centric problem with a non-data focused approach can sometimes be simple and time efficient. For instance, asking a user for its preferred location rather than inferring it using a complex heuristic.

2. Don’t Skip Out on Prototyping

Speed is critical! The first iterations of a brand new product often go through many changes. Prototypes allow for quick and cheap learning cycles. They also help prevent the sunk cost fallacy (when a past investment becomes a rationale for continuing). 

In the data world, a good rule of thumb is to leverage spreadsheets for building a prototype. Spreadsheets are a versatile tool that help accelerate build times, yet are often underutilized by data scientists and engineers. By design, spreadsheets allow the user to make sense of data in messy contexts, with just a few clicks. The built-in functions cover most basic use cases: 

  • cleaning data by hand rapidly
  • displaying graphs
  • computing basic ranking indices
  • formatting output data.

While creating a robust system is desirable, it often comes at the expense of longer building times. When releasing a brand new data product under a tight timeline, the focus should be on developing prototypes fast. 

A sample Google Sheet dashboard evaluating Inbound Leads.  The dashboard consists of 6 charts.  The 3 line charts on the left measure Lead Count, Qualification Rate %, and Time to Qualification in Minutes.  The 3 bar charts on the right  measure Leads by Channel, Leads by Country, and Leads by Last Source Touched.
An example of a dashboard prototype created within Google Sheets.

Despite a strong emphasis on speed, a prototype should still look and feel professional. For example, the first iteration of the Marketing attribution tool developed for Shopify’s Revenue team was a collection of SQL queries automated by a bash script. The output was then formatted in a spreadsheet. This allowed us to quickly make changes to the prototype and compare it to out-of-the-box tools. We avoided any wasted effort spinning up dashboards, production code, as well as any sentimental attachment to the work, which made it easier for the best solution to win.

3. Avoid Machine Learning (on First Iterations)

When building a new data product, it’s tempting to spend lots of time on a flashy machine learning algorithm. This is especially true if the product is supposed to be “smart”. Building a machine learning model for your first iteration can cost a lot of time. For example, when sprinting to build a lead scoring system for our Sales Representatives, our team spent 80% of the sprint gathering features and training a model. This left little time to integrate the product with the existing customer relationship management (CRM) infrastructure, polish it, and ask for feedback. A simple ranking using a proxy metric would be much faster to implement for the first iteration. The time gained would allow for more conversations with the users about the impact, use and engagement with the tool. 

We took that lesson to heart in our next project when we built a sales forecasting tool. We started with a linear regression using only two input variables that allowed us to have a prototype ready in a couple of hours. Using a simple model allowed us to ship fast and quickly learn whether it solved our user’s problem. Knowing we were on the right track, we then built a more complex model using machine learning.

Focus on building models that solve problems and can be shipped quickly. Once you’ve proven that your product is effective and delivers impact, then you can focus your time and resources on building more complex models. 

4. Talk to Your Users

Shipping fast also means shipping the right product. In order to stay on track, gathering feedback from users is invaluable! It allows you to build the right solution for the problem you’re tackling. Take the time to talk to your users, before, during, and after each build iteration. Shadowing them, or even doing the task yourself is a great return on investment.

Gathering feedback is an art. Entire books and research papers are dedicated to it. Here are the two tips we use at Shopify that increased the value of feedback we’ve received:

  • Ask specific questions. Asking, “Do you have any feedback?” doesn’t help the user direct their thoughts. Questions like, “How do you feel about the speed at which the dashboard loads?” or “Are you able to locate the insights you need on this dashboard to report on top of funnel performance?” are more targeted and will yield richer feedback.
  • Select a diverse group of users for feedback. Let’s suppose that you are building a dashboard that’s going to be used by three regional teams. It’s more effective to send a request for feedback to one person in each team rather than five people in a single team.
A sample Google Form that measures Prototype A's Scoring.  The form consists of 2 questions. The first question is "Is the score easy to parse and interpret? It is scored using a ranking from 1 - 5 with 1 = Very Hard and 5 = Very Easy. The 2nd question is "Additional Comments" and has a text field for the answer.
Feedback our team asked for the scoring system we created. When asking for feedback, you want to ask specific questions so you can yield better feedback.

We implemented the two tips when requesting feedback from users of the sales forecasting tool highlighted in the previous section. We asked a diverse group specific questions about our product, and learned that displaying a numerical score (0 - 100) was confusing. The difference between scores wasn’t clear to the users. Instead, it was suggested to display grades (A, B, C) which turned out to be much quicker to interpret and led to a better user experience.

At Shopify, following these tips has provided the team with a clearer path for launching brand new data products under tight time constraints. More importantly, it helped us avoid common pitfalls like getting stuck during neverending design phases, overengineering complex machine learning systems, or building data products that users don’t use. 

Next time you’re under a tight timeline to ship a new data product, remember to:

  1. Utilize design sprints to help focus your team’s efforts and remove the stress of the ticking clock
  2. Don’t skip on prototyping, it’s a great way to fail early
  3. Avoid machine learning (for first iterations) to avoid being slowed down by unnecessary complexity
  4. Talk to your users so you can get a better sense of what problem they’re facing and what they need in a product

If you’d like to read more about shipping new products fast, we recommend checking out The Design Sprint book, by Jake Knapp et al. which provides a complete framework for testing new ideas.


If you’re interested in helping us ship great data products, quickly, we’re looking for talented data scientists to join our team.

Continue reading

Read Consistency with Database Replicas

Read Consistency with Database Replicas

At Shopify, we’ve long used database replication for redundancy and failure recovery, but only fairly recently started to explore the potential of replicas as an alternative read-only data source for applications. Using read replicas holds the promise of enhanced performance for read-heavy applications, while alleviating pressure on primary servers that are needed for time-sensitive read/write operations.

There's one unavoidable factor that can cause problems with this approach: replication lag. In other words, applications reading from replicas might be reading stale data—maybe seconds or even minutes old. Depending on the specific needs of the application, this isn’t necessarily a problem. A query may return data from a lagging replica, but any application using replicated data has to accept that it will be somewhat out of date; it’s all a matter of degree. However, this assumes that the reads in question are atomic operations.

In contrast, consider a case where related pieces of data are assembled from the results of multiple queries. If these queries are routed to various read replicas with differing amounts of replication lag and the data changes in midstream, the results could be unpredictable.

 

Reading from replicas with varying replication lag produces unpredictable results

Reading from replicas with varying replication lag produces unpredictable results

For example, a row returned by an initial query could be absent from a related table if the second query hits a more lagging replica. Obviously, this kind of situation could negatively impact the user experience and, if these mangled datasets are used to inform future writes, then we’re really in trouble. In this post, I’ll show you the solution the Database Connection Management team at Shopify chose to solve variable lag and how we solved the issues we ran into.

Tight Consistency

One way out of variable lag is by enforcing tight consistency, meaning that all replicas are guaranteed to be up to date with the primary server before any other operations are allowed. This solution is expensive and negates the performance benefits of using replicas. Although we can still lighten the load on the primary server, it’s at the cost of delayed reads from replicas.

Causal Consistency

Another approach we considered (and even began to implement) is causal consistency based on global transaction identifier (GTID). This means that each transaction in the primary server has a GTID associated with it, and this GTID is preserved as data is replicated. This allows requests to be made conditional upon the presence of a certain GTID in the replica, so we can ensure replicated data is at least up to date with a certain known state in the primary server (or a replica), based on a previous write (or read) that the application has performed. This isn’t the absolute consistency provided by tight consistency, but for practical purposes it can be equivalent.

The main disadvantage to this approach is the need to implement software running on each replica which would report its current GTID back to the proxy so that it can make the appropriate server selection based on the desired minimum GTID. Ultimately, we decided that our use cases didn’t require this level of guarantee, and that the added level of complexity was unnecessary.

Our Solution to Read Consistency

Other models of consistency in replicated data necessarily involve some kind of compromise. In our case, we settled on a form of monotonic read consistency: successive reads will follow a consistent timeline, though not necessarily reading the latest data in real time. The most direct way to ensure this is for any series of related reads to be routed to the same server, so successive reads will always represent the state of the primary server at the same time or later, compared to previous reads in that series.

Reading repeatedly from a single replica produces a coherent data timeline
Reading repeatedly from a single replica produces a coherent data timeline

In order to simplify implementation and avoid unnecessary overhead, we wanted to offer this functionality on an opt-in basis, while at the same time avoiding any need for applications to be aware of database topology and manage their own use of read replicas. To see how we did this, let’s first take a step back.

Application access to our MySQL database servers is through a proxy layer provided by ProxySQL using the concept of hostgroups: essentially pools of interchangeable servers which look like a single data source from the application’s point of view.

A modified illustration from the ProxySQL website shows its place in our architecture
A modified illustration from the ProxySQL website shows its place in our architecture

When a client application connects with a user identity assigned to a given hostgroup, the proxy routes its individual requests to any randomly selected server within that hostgroup. (This is somewhat oversimplified in that ProxySQL incorporates considerations of latency, load balancing, and such into its selection algorithm, but for purposes of this discussion we can consider the selection process to be random). In order to provide read consistency, we modified this server selection algorithm in our fork of ProxySQL.

Any application which requires read consistency within a series of requests can supply a unique identifier common to those requests. This identifier is passed within query comments as a key-value pair:

/* consistent_read_id:<some unique ID> */ SELECT <fields> FROM <table>

The ID we use to identify requests is always a universally unique identifier (UUID) representing a job or any other series of related requests. This consistent_read_id forms the basis for a pseudorandom but repeatable index into the list of servers that replaces the default random indexing taking place in the absence of this identifier. Let’s see how.

First, a hashing algorithm is applied to the consistent_read_id to yield an integer value. We calculate the modulo of this number and the number of servers that becomes our pseudorandom index into the list of available servers. Repeated application of this algorithm yields the same pseudorandom result, thus maintaining read consistency over a series of requests specifying the same consistent_read_id. This explanation is simplified in that it ignores the server weighting which is configurable in ProxySQL. Here’s what an example looks like, including the server weighting:

The <code>consistent_read_id</code> is used to generate a hash that yields an index into a weighted list of servers.  In this example, Every time we receive this particular consistent_ read_ id, server 1 will be selected.
The consistent_read_id is used to generate a hash that yields an index into a weighted list of servers. In this example, every time we receive this particular consistent_ read_ id, server 1 will be selected.

A Couple of Bumps in the Road

I’ve covered the basics of our consistent-read algorithm, but there were a couple of issues to be addressed before the team got it working perfectly.

The first one surfaced during code review and relates to situations where a server becomes unavailable between successive consistent read requests. If the unavailable server is the one that was previously selected (and therefore would’ve been selected again), a data inconsistency is possible—this is a built-in limitation of our approach. However, even if the unavailable server isn’t the one that would’ve been selected, applying the selection algorithm directly to the list of available servers (as ProxySQL does with random server selection) could also lead to inconsistency, but in this case unnecessarily. To address this issue, we index into the entire list of configured servers in the host group first, then disqualify the selected server and reselect if necessary. This way, the outcome is affected only if the selected server is down, rather than having the indexing potentially affected for others in the list. Discussion of this issue in a different context can be found on ebrary.net.

Indexing into configured servers rather than available servers provides a better chance of consistency in case of server failures

Indexing into configured servers rather than available servers provides a better chance of consistency in case of server failures

The second issue was discovered as an intermittent bug that led to inconsistent reads in a small percentage of cases. It turned out that ProxySQL was doing an additional round of load balancing after initial server selection. For example, in a case where the target server weighting was 1:1 and the actual distribution of server connections drifted to 3:1, any request would be forcibly rerouted to the underweighted server, overriding our hash-based server selection. By disabling the additional rebalancing in cases of consistent-read requests, these sneaky inconsistencies were eliminated.

Currently, we're evaluating strategies for incorporating flexible use of replication lag measurements as a tuneable factor that we can use to modify our approach to read consistency. Hopefully, this feature will continue to appeal to our application developers and improve database performance for everyone.

Our approach to consistent reads has the advantage of relative simplicity and low overhead. Its main drawback is that server outages (especially intermittent ones) will tend to introduce read inconsistencies that may be difficult to detect. If your application is tolerant of occasional failures in read consistency, this hash-based approach to implementing monotonic read consistency may be right for you. On the other hand, if your consistency requirements are more strict, GTID-based causal consistency may be worth exploring. For more information, see this blog post on the ProxySQL website.

Thomas has been a software developer, a professional actor, a personal trainer, and a software developer again. Currently, his work with the Database Connection Management team at Shopify keeps him learning and growing every day.


We're always on the lookout for talent and we’d love to hear from you. Visit our Engineering career page to find out about our open positions.

Continue reading

Bound to Round: 8 Tips for Dealing with Hanging Pennies

Bound to Round: 8 Tips for Dealing with Hanging Pennies

Rounding is used to simplify the use of numbers that contain more decimal places than required. The perfect example is representing cash, money, dough. In the USA and Canada, the cent represents the smallest fraction of money. The US and Canadian dollar can’t be transacted with more than 2 decimal places. When numbers represent money, we use rounding to replace an un-representable, un-transactable money amount with one that represents a cash tender.

The best way to introduce this blog is by asking you to watch a scene from one of my favorite movies, Office Space:

In this scene, Peter describes to his girlfriend a program that compounds interest using high precision amounts. He explains that they simplify the calculations by rounding the amounts down and by doing that they’re left with hanging pennies that they transfer into their personal accounts.

This is exactly what we want to avoid—we want to avoid having one developer aware of hanging pennies. We also want to avoid having many hanging pennies. And when faced with such a situation, we want to identify such calculations and put a plan in place on who to notify and what to do with them. 

Before I explain this further, I want to tell you this story first. My father introduced banking software systems in the Middle East in the late 70’s. Rest assured he was bound to round. He faced the same issue Peter faced. He resolved it by accepting that he can’t resolve it. So, he created an account where the extra pennies accumulated and later were given as bonuses to the IT team at the bank. It was a way of getting back at the rest of the employees at the bank that didn’t want to move to using a software system and preferred pen and paper.

The Rounding Dilemma

Okay, let’s get back to breaking this problem down further with another example.

Let’s assume we can only charge 1 total amount, even if this 1 amount consists of a summation of multiple rates.

Rate 1 is 2.4%
Rate 2 is 2.9%
Amount $10.10

When rounding individual rate amounts:
Rate 1 total = (rate /100) * $10.10 = 0.2424 = rounded = 0.24
Rate 2 total = (rate /100) * $10.10 = 0.2929 = rounded = 0.29
Total = 0.24 + 0.29 = 0.53

When rounding total of the rate amounts:
Rate 1 total = (rate /100) * $10.10 = 0.2424
Rate 2 total = (rate /100) * $10.10 = 0.2929

Total = 0.2424 + 0.2929 = 0.5353 = rounded = 0.54

The example above makes it clear that deciding when to round can either make you more money by collecting the loose penny or lose money by deciding to let go of it.

Rounding at different stages in the example above has more impact if there are currency conversions involved. As a rule of thumb, the more currency conversions (which also involve rounding) and more rounding, the more we lose precision along the way. 

Rational numbers are natural products of various banking calculations: distributed payments, shared liabilities, and rates applied. So, you’ll face other rounding encounters in many other places in financial software, most notably while calculating taxes or discounts and, just like in Office Space, while calculating interest. 

Did it make cents? I hope you have a grasp on the problem. Now, is this avoidable? No, it’s not. If you’re working on financial software you’ll eventually be bound to round. But, we can control where and how to handle the precision loss. I’m sharing 8 tips to make your precision obsessive compulsiveness a bit less troubling to you as a developer and to the company as a business.

1. Notify Stakeholders

Show and tell where the rounding happens within your calculations to the stakeholders of your project. Explain the impact of the rounding, document it, and keep talking about it until all leaders on your team and within your department are aware. You, as a developer, don’t have to take the full burden of knowing that the company is making less than 1 cent on some transactions because of the calculations you put in place. Is a problem really a problem if it’s everyone’s problem?!

2. Use Banker’s Rounding

There are many types of rounding. There are rounding methods that increase bias and rounding methods that decrease bias. Banker’s rounding is the method proven to decrease rounding bias within calculations. Banking rounding deliberately distorts some of the rounded values to bring rounding totals of rounded numbers as close to the totals of the original numbers. Talking about why regular rounding taught in schools can’t meet our needs and why Banker’s rounding is mostly used for financial calculations would turn this blog into a math lesson, and as much as I would love to do that, I’d probably lose many readers.

3. Use Data Types That Hold the Most Precision

Within your calculations, ensure that all variables used are data types that can hold as much precision as possible (can hold enough decimal points). For example, using a double instead of a float. It’s important to keep the precision wherever there isn’t rounding involved as it reduces the amount of hanging pennies. 

4. Be Consistent

I mean, this applies to a lot of things in life. When you and your team decide on which rounding methods to use, ensure that the same rounding method is used throughout your code. 

5. Be Explicit About Rounding

When rounding within your calculation make it explicit by either adding comments or prefix rounded variables with “rounded_”. This ensures that anyone reading your code understands where precision loss is happening. Link to documentation about rounding strategies within your code documentation.

6. Refer to Government Rounding Standards

A photo of the 1040 U.S. Individual Income Tax Return form on a desk.
The 1040 U.S. Individual Income Tax Return form

Losing precision is a universal problem and not only suffered by mathematicians. Refer to your government’s ruling around rounding. When it comes to tax calculations, governments might have different rules. Refer to them and educate yourself and your team.

7. Round Only When You Absolutely Have To

Remember, only tender money amounts need to be rounded. Whenever you can avoid rounding, do so!

8. Tell Your Users

Please don’t hide what rounding methods you use to your users. Many users will try to reverse engineer calculations on their own, and as a company you don’t want to end up explaining this several times. Ensure rounding rules are explicitly written in your documentation and easily accessible. 

A circular logo with a Shopify shopping bag above the words "Be Merchant Obsessed. What Shopify Values"
Be Merchant Obsessed

At Shopify, we are, of course, bound to round. If you are a Shopify merchant reading this post I want to assure you that in all our calculations, developers are biased towards benefiting our merchants. Not only are our support teams merchant obsessed, all Shopify developers are too.

Dana is a senior developer on the Money team at Shopify. She’s been in software engineering since 2007. Her primary interests are back-end development, database design, and software quality management. She's contributed to a variety of products, and since joining Shopify she's been on the Shopify Payments and Balance teams. She recently switched to data development to deliver impactful money insights to our merchants.


We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Continue reading

Using Betas to Deploy New Features Safely

Using Betas to Deploy New Features Safely

For companies like Shopify that practice continuous deployment, our code is changing multiple times every day. We have to de-risk new features to ship safely and confidently without impacting the million+ merchants using our platform. Beta flags are one approach to feature development that gives us a number of notable advantages.

Continue reading

Technical Mentorship Reimagined: Time-bound and No Awkward Asks Necessary

Technical Mentorship Reimagined: Time-bound and No Awkward Asks Necessary

Authors: Sarah Naqvi and Steve Lounsbury

Struggling with a concept and frantically trying to find the answers online? Are you thinking: I should just ping a teammate or Is this something I should already know? That’s fine, I’ll ask anyway. And then you do ask and get an answer and are feeling pretty darn proud of yourself and move forward. But wait, now I have about 25 follow-up questions. Sound familiar?

Or how about the career growth questions that can be sometimes too uncomfortable asking your lead or manager? Oh, I know, I’ll take them to reddit. But wait, they have no company context. Yeah, also familiar. I know. I should get a mentor! But darn. I’ve heard so many stories… Who do I approach? Will they be interested? What if they reject me? And what if it’s not working out?

Shopify is a collaborative place. We traditionally pair with other developers and conduct code reviews to level up. This approach is great for just-in-time feedback and unblocking us on immediate problems. We wanted to continue this trend and also look at how we can support developers in growing themselves through longer term conversations.

We surveyed our developers at Shopify and learned that they:

  • Love to learn from others at Shopify
  • Are busy people and find it challenging to make dedicated time for learning
  • Want to grow in their technical and leadership skills.

These findings birthed the idea of building a mentorship program targeted at solving these exact problems. Enter Shopify’s Engineering Mentorship Program. Shopify’s RnD Learning Team partnered with developers across the company to design a unique mentorship program and this guide walks readers through the structure, components, value-adds, and lessons we’ve had over the past year.

Gathering Participants

Once a quarter developers get an email inviting them to join the upcoming cycle of the program and sign up to be a mentee, mentor, or both. In addition to the email that is sent out, updates are posted in a few prominent Slack channels to remind folks that the signup window for the upcoming cycle is now open.

When signing up to participate in the program, mentees are asked to select their areas of interest and mentors are asked to select their areas of expertise. The current selections include:

  • Back-end
  • Data
  • Datastores
  • Front-end
  • Infrastructure
  • Mobile - Android
  • Mobile - iOS
  • Mobile - React Native
  • Non-technical (leadership, management).

Mentors are also prompted to choose if they are interested in supporting one or two mentees for the current cycle.

Matching Mentors and Mentees

Once the signup period wraps up, we run an automated matching script to pair mentees and mentors. Pairs are matched based on a few criteria:

  • Mentor isn’t mentee's current lead
  • Mentor and mentee don’t report to same lead
  • Aligned areas of interest and expertise based on selections in sign-up forms
  • Mentee job level is less than or equal to mentor job level.

The matching system intentionally avoids matching based upon criteria such as geographic location or department to broaden the developer’s network and gain company context they would have otherwise not been exposed to.

Pairs are notified by email of their match and invited to a kickoff meeting where organizers welcome participants, explain the program model and value that they will receive as a mentor or mentee, and answer any questions.

Running the Six Week Program Cycle

Each program cycle runs for six weeks and pairs are expected to meet for approximately one hour per week.

Shopify’s Engineering Mentorship Program overview
Shopify’s Engineering Mentorship Program overview

The time bound nature of the program enables developers to try out the program and see if it’s a good fit for them. They can connect with someone new and still feel comfortable knowing that they can walk away, no strings attached.

The voluntary signup process ensures that developers who sign up to be a mentor are committed to supporting a mentee for the six week duration and mentees can rest assured that their mentor is keen on supporting them with their professional growth. The sign-up emails as well as the sign-up form reiterate the importance of only signing up as a mentor or mentee if you can dedicate at minimum an hour per week for the six week cycle.

Setting Goals

In advance of the first meeting, mentees are asked to identify technical skills gaps they want to improve. During their first meeting, mentees and mentors work together building a tangible goal that they can work towards over the course of the six weeks. Goals often change and that’s expected.

Through the initial kickoff meeting and weekly check-ins via Slack, we reinforce and reiterate throughout the program that the goal itself is never the goal, but an opportunity to work towards a moving target.

Defining the goal is often the most challenging part for mentees. Mentors take an active role in supporting them craft this—the program team also provides tons of real examples from past mentees.

Staying Connected as a Group

Outside of the one on one weekly meetings between each pairing of mentor and mentee, the broader mentorship community stays connected on Slack. Two Slack channels are used to manage the program and connect participants with one another and with the program team.

The first Slack channel is a public space for all participants as well as anyone at the company who is curious about the program. This channel serves the purpose to advertise the program and keep participants connected to each other as well as to the program team. This is done by regularly asking questions and continuously sharing their experiences of what’s working well, how they’ve pivoted (or not) from their initial goals, and any general tips to support fellow mentors and mentees throughout the program.

The second Slack channel is a private space that is used exclusively by the program team and mentors. This channel is a space for mentors to be vulnerable and lean on fellow mentors for suggestions and resources.

Preparing the Participants with a Guidebook

Beyond the Slack channels, the other primary resource our participants use is a mentorship guidebook that curates tips and resources for added structure. The program team felt that a guidebook was an important aspect to include for participants who were craving more support. While it is entirely an optional resource to use, many first time mentors and mentees find it to be a helpful tool in navigating an otherwise open ended relationship between themselves and their match. It includes tips on sample agendas for your first meeting, example goals, and ways to support your mentee. For example, invite them to one of your meetings and debrief afterwards, pair, or do a code review.

Growing Mentor’s Skills Too

Naturally teaching someone else a technical concept helps reinforce it in our own minds. Our mentors constantly share how they’ve found the program helps refine their craft skills as well:

“Taking a step back from my day-to-day work to meet with [them] and chatting about career goals at a higher level, gave me more insight into what I personally want from my career path as well.”

The ability to mentor others in their technical acumen and craft is a skill that’s valued at Shopify. As engineers progress in their career here, being an effective mentor becomes a bigger piece of what’s expected in the role. The program gives folks a place to improve both their mentorship and leadership skills through iteration.

Throughout the program, mentors receive tips and resources from engineering leaders at the company that are curated by the program team and shared via Slack, but the most valuable piece ends up being the support they provide each other through a closed channel dedicated to mentors.

Here’s an actual example of how mentors help unblock each other:

Mentor 1: Hey! Im curious to know how y’all are supporting your mentees in coming up with a measurable goal? My mentee’s original goal was “learn how Shopify core works” and we’ve scoped that down to “learn how jobs are run in core” but we still don’t have it being something that’s measurable and can clearly be ticked off by the end of the 6 weeks. They aren’t the most receptive to refining the goal so I’m curious how any of you have approached that as well?

Mentor 2: Hmmm, I’d try to get to the “why” when they say they want to learn how Shopify core works. Is it so they can find things easier? Make better decisions by having more context? Are they interested in why it’s structured the way it is to inform them on future architecture decisions? Maybe that could help in finding something they can measure. Or if they’re just curious, could the goal be something like being able to explain to someone new to Shopify what the different components of the system are and how they interact? Or they’re able to create a new job in core in x amount of time?

Mentor 3: If you've learned how something works, you should be able to tell someone else. So I turn these learning goals into a goal to write a wiki page or report, make a presentation, or teach someone else one on one.

Mentor 1: Thanks for all the replies! I surfaced adapting the learning goal to have an outcome so they've decided on building an example that can be used as documentation and shared with their team. They're writing this example in the component their team maintains as well which will help in "learn how Shopify works" as they weren't currently familiar with the component.

Gathering Program Feedback

At the end of the six weeks mentees and mentors are asked to provide constructive feedback to one another and the program officially comes to a close. 

Program participants receive a feedback survey that helps organizers understand what’s working well and what to revise for future cycles. Participants share

  • Whether they would recommend the program to someone else or not
  • What the best part of the program was for them
  • What they would like to see improved for future cycles.

Tweaks are made within a short month or so and the next cycle begins. A new opportunity to connect with someone else, grow your skills, and do it in a time-bound and supportive environment.

What We’ve Learned in 2020

Overall, it’s been working well. The type of feedback we receive from participants is definitely share-worthy:

“was phenomenal to learn more about Production Engineering and our infrastructure. Answered hundreds of mind-numbing questions with extreme patience and detail; he put in so much time to write and share resources—and we wrapped up with a live exercise too! I always look forward to our sessions and it was like walking into a different, fantasy-like universe each time. Hands down the best mentoring experience of my professional career thus far.” - Senior Developer

  • We’ve had 300+ developers participate as mentees and 200+ as mentors.
  • 98% of survey respondents indicated that they would recommend the program to someone else.
  • The demand is there. Each cycle of the program we have to turn away potential mentees because we can’t meet the demand due to limited mentors. We are working on strategies to attract more mentors to better support the program in 2021.

The themes that emerged in terms of where participants found the most value are around:

  • Building relationships: getting to know people is hard. Getting to know colleagues in a fully remote world is near impossible. This program helps.
  • Having dedicated time for learning set aside each week: we’ve all got a lot to do. We know that making time for learning is important, but it can easily fall on the back burner. This program helps with that too.
  • Developing technical craft skills: growing your technical chops? Say no more.
  • Developing skills as a teacher and mentor: getting better at supporting peers can be tricky. You need experience and a safe space to learn with real people.
  • Gaining broader Shopify context: being a T-shaped developer is an asset. By T-shaped we are referring to the vertical bar of the “T” as the areas the developer is an expert in, while the horizontal bar refers to areas where the developer has some breadth and less depth of knowledge. Getting outside of our silos and learning about how different parts of Shopify work helps build stronger developers and a stronger team.

Reinvesting in the developers at Shopify is one way that we help individuals grow in their careers and increase the effectiveness of our teams.

Some great resources that inspired us:

Sarah is a Senior Program Manager who manages Engineering learning programs at Shopify. For the past five years, she has been working on areas such as designing and building the initial version of the Dev Degree program to now designing and delivering the Engineering Mentorship Program. She is focused on helping our engineers develop their professional skills and creating a strong Engineering community at Shopify.

Steve Lounsbury is a Sr Staff Developer at Shopify. He has been with Shopify since 2013 and is currently working to improve the checkout experience.


We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Continue reading

How to Make Dashboards Using a Product Thinking Approach

How to Make Dashboards Using a Product Thinking Approach

It’s no secret that communicating results to your team is a big part of the data science craft. This is where we drive home the value of our work, allowing our stakeholders to understand, monitor, and ultimately make decisions informed by data. One tool we frequently use at Shopify to help us is the data dashboard. This post is a step-by-step guide to how you can create dashboards that are user-centred and impact-driven.

People use the word dashboard to mean one of several different things. In this post I narrow my definition to mean an automatically updated collection of data visualisations or metrics giving insight into a set of business questions. Popular dashboard-building tools for product analytics include Tableau, Shiny, or Mode.

Unfortunately, unless you’re intentional about your process, it can be easy to put a lot of work into building a dashboard that has no real value. A dashboard that no one ever looks at is about as useful as a chocolate teapot. So, how can you make sure your dashboards meet your users’ needs every time? 

The key is taking a product thinking approach. Product thinking is an integral part of data science at Shopify. Similar to the way we always build products with our merchants in mind, data scientists build dashboards that are impact-focused, and give a great user experience for their audience.

When to Use a Dashboard

Before we dive into how to build a dashboard, the first thing you should ask yourself is whether this is the right tool for your situation. There are many ways to communicate data, including longform documents, presentations, and slidedocs. Dashboards can be time consuming to create and maintain, so we don’t want to put in the effort unnecessarily.

Some questions to ask when deciding whether to build a dashboard are

  • Will the data be dynamically updated?
  • Do you want the exploration to be interactive?
  • Is the goal to monitor something or answer data-related questions?
  • Do users need to continuously refer back to this data as it changes over time?

If most or all of the answers are “Yes”, then a dashboard is a good choice. 

If your goal is to persuade your audience to take a specific action, then a dashboard isn’t the best solution. Dashboards are convenient because they automatically serve updated metrics and visualisations in response to changing data. However, this convenience requires handing off some amount of interpretation to your users. If you want to tell a curated story to influence the audience, you’re better off working with historic, static data in a data report or presentation.

1. Understand the Problem and Your Audience

Once you have decided to build a dashboard, it’s imperative to start with a clear goal and audience in mind—that way you know from the get-go that you’re creating something of value. For example, at Shopify these might be

Audience

Goal

Your data team

Decide whether to ship an experimental feature to all our merchants.

Senior leadership

Monitor the effect of COVID-19 on merchants with retail stores to help inform our response.

Your interdisciplinary product team

Detect changes in user behaviour caused by shipping a new feature.

If you find that you have more than one stated goal (for example, both monitoring for anomalies and tracking performance), this is a flag that you need more than one dashboard.

Having clearly identified your audience and the reason for your dashboard, you’ll need to figure out what metrics to include that best serve their needs. A lot of the time this isn’t obvious and can turn into a lengthy back and forth with your users, which is ok! Time spent upfront will pay dividends later on. 

Good metrics to include are ones carefully chosen to reflect your stated goals. If your goal is to monitor for anomalies, you need to include a broad range of metrics with tripwire thresholds. If you want a dashboard to tell how successful your product is, you need to think deeply about a small number of KPIs and north stars that are proxies for real value created. 

An example of how you could whiteboard a dashboard plan to show your stakeholders. Visuals are helpful to get everyone aligned.
An example of how you could whiteboard a dashboard plan to show your stakeholders. Visuals are helpful to get everyone aligned.

Once you’ve decided on your metrics and data visualisations, make a rough plan of how they are presented; this could be a spreadsheet, something more visual like a whiteboard sketch, or even post-it notes. Present this to your stakeholders before you write any code: you’ll refine it as you go, but the important thing is to make sure what you’re proposing will help to solve their problem.

Now that you have a plan, you’re ready to start building.

2. Build a Dashboard with Your Users In Mind

The tricky thing about creating a dashboard is that the data presented must be accurate and easy to understand for your audience. So, how do you ensure both of these attributes while you’re building? 

When it comes to accuracy and efficiency, depending on what dashboard software you’re using, you’ll probably have to write some code or queries to create the metrics or visualisations from your data. Like any code we write at Shopify, we follow software best practices. 

  • Use clean code conventions such as Common Table Expressions to make queries more readable 
  • Use query optimisations to make them as run as efficiently as possible
  • Use version control to keep track of code changes during the development process 
  • Get the dashboard peer reviewed for quality assurance and to share context

The way that you present your data will have a huge impact on how easily your users understand and use it. This involves thinking about the layout, the content included or excluded, and the context given.

Use Your Layout to Focus Your Users’ Attention

Like the front page of a newspaper, your users need to know the most important information in the first few seconds.

One way you can do this is to structure your dashboard like an inverted pyramid, with the juicy headlines (key metrics) at the top, important details (analysis and visualisations) in the middle, and more general background info (interesting but less vital analyses) at the bottom. 

Above is an inverted pyramid demonstrating  how you can think about the hierarchy of the information you present in your dashboard.
Above is an inverted pyramid demonstrating  how you can think about the hierarchy of the information you present in your dashboard.

Remember to use the original goals decided on in step one to inform the hierarchy.

Keep the layout logical and simple. Guide the eye of your reader over the page by using a consistent visual hierarchy of headings and sections where related metrics are grouped together to make them easy to find.

Visual hierarchy, grouped sections and whitespace will make your dashboard easier to read.
Visual hierarchy, grouped sections and whitespace will make your dashboard easier to read.

Similarly, don’t be afraid to add whitespace, it gives your users a breather, and when used properly it increases comprehension of information.

Keep Your Content Sparing But Targeted

The visualizations you choose to display your data can make or break the dashboard. There’s a myriad of resources on this so I won’t go in-depth, but it’s worth becoming familiar with the theory and experimenting with what works best for your situation. 

Be brave and remove any visualisations or metrics that aren’t directly relevant to stated goals. Unnecessary details bury the important facts under clutter. If you need to include them, consider creating a separate dashboard for secondary analyses.

Ensure Your Dashboard Includes Business and Data Context

Provide enough business context so that someone discovering your dashboard from another team can understand it at a high level, such as:

  • Why this dashboard exists 
  • Who it’s for
  • When it was built, and if and when it’s set to expire 
  • What features it’s tracking via links to team repositories, project briefs, screenshots, or video walkthroughs

Data context is also important for the metrics on your dashboard as it allows the user to anchor what they are seeing to a baseline. For example, instead of just showing the value for new users this week, add an arrow showing the direction and percentage change since the same time last week.

The statistic on the right has more value than the one on the left because it is given with context.
The statistic on the right has more value than the one on the left because it is given with context.

You also can provide data context by segmenting or filtering your data. Different segmentations of data can give results with completely opposite interpretations. Leveraging domain-specific knowledge is the key to choosing appropriate segments that are likely to represent the truth.

Before You Ship, Consider Data Freshness

Your dashboard is only as fresh as the last time it was run, so think about how frequently the data is refreshed. This might be a compromise between how often your users want to look at the dashboard, and how resource-intensive it is to run the queries.

Finally, it’s best practice to get at least two technical reviews before you ship. It’s also worth getting sign-off from your intended users. After all, if they don’t understand or see value in it, they won’t use it.

3. Follow Up and Iterate

You’ve put in a lot of work to understand the problem and audience, and you’ve built a killer dashboard. However, it’s important to remember the dashboard is a tool. The ultimate goal is to make sure it gets used and delivers value. For that you’ll need to do some follow-up.

Market It

It’s up to you to spread awareness and make sure the dashboard gets into the right hands. The way you choose to ‘market’ your dashboard depends on the audience and team culture, but in general, it’s a good idea to think about both how to launch it, and how to make it discoverable longer-term. 

For the launch, think about how you can present what you’ve made in a way that will maximise uptake and understandability. You might only get one chance to do this, so it’s worth being intentional. For example, you might choose to make a well-crafted announcement via a team email and provide an accompanying guide for how to use the dashboard, such as a short walk-through video. 

In the long-term make sure that after launching your dashboard that it's easily discoverable by whoever might need it. This might mean making it available through an internal data portal and using a title and tags tailored to common search terms. You might think about ways to re-market the dashboard once time has passed. Don’t be afraid to resurface or make noise about the dashboard when you find organic moments to do so. 

Use It and Improve It

Return to the goals  identified in step one and think about how to make sure these are reached. 

For example, if the dashboard was created to help decide whether to ship a specific feature, set a date for when this should happen and be prepared to give your opinion to the team based on the data at this point.

Monitor usage of the dashboard to find out how often it’s being used, shared, or quoted. It gives insight into how much impact you’ve had with it.

If the dashboard didn’t have the intended outcome, figure out why not. Is there something you could change to make it more useful or would do differently next time? Use this research to help improve the next dashboard.

Maintain It

Finally, as with any data product, without proper maintenance a dashboard will fall into disrepair. Assign a data scientist or team to answer questions or fix any issues that arise.

Key Takeaways

I’ve shown you how to break down the process of building a dashboard using product thinking. A product-thinking approach is the key to maximising the business impact created proportional to the time and effort put in. 

You can take a product-thinking approach to building a impact-driving dashboard by following these steps:

  • Understand your problem and your audience; design a dashboard that does one thing really well, for a clear set of users 
  • Build the dashboard with your users in mind, ensuring it is accurate and easy to understand 
  • Follow up and iterate on your work by marketing, improving and maintaining it into the future. 

By following these three steps, you will create dashboards that put your users front and centre. 

If you’re interested in using data to drive impact, we’re looking for talented data scientists to join our team.

Lin has been a Data Scientist at Shopify for 2 years and is currently working on Merchandising, a team dedicated to helping our merchants be as successful as possible at branding and selling their products. She has a PhD in Molecular Genetics and used to wear a lab coat to work.

Continue reading

Using GraphQL for High-Performing Mobile Applications

Using GraphQL for High-Performing Mobile Applications

GraphQL is the syntax that describes data that a client asks from a server. The client, in this case, is a mobile application. GraphQL is usually compared with REST API, a common syntax that most mobile application developers use. We will share how GraphQL can solve some of the pain points of REST API in mobile application development and discuss tips and best practices that we learned at Shopify by using GraphQL in our mobile applications.

Why Use GraphQL?

A mobile application generally has four basic layers in the codebase:
  1. Network layer: defines the connection and the server to connect to send/receive data.
  2. Data model layer: translates data coming from the network layer to understandable data for local app models.
  3. View models layer: translates data models to understandable models for the user interface.
  4. User interface layer: presents/receives data to/from the user.
Four layers in a mobile application: Network layer, Data model layer, View models layer, User interface layer


A network layer and data model layer are needed for an app to talk to a server and translate that information to view layers. GraphQL can fit into these two layers and base a data graph layer and solve most of the pain points mobile developers used to have when using REST APIs.

One of the pain points when using REST APIs is that the data coming from the server should be mapped many times to different object types in order to be presented on the screen or vice versa from input in the screen to be sent to the server. Simpler apps might have fewer of these mappings depending on if the app has a local database to store data or if the app is online only. But mobile apps surely have the mapping to convert the JSON data coming from an API to a class object (for example, Swift objects ).

When working with REST endpoints these mappings are basically matching statically typed code with the unstructured JSON responses. In other words, mobile developers are asked to hard code the type of a field and cast the JSON value to the assumed type. Sometimes developers validate and assert the type. These castings or validations might fail as we know the server is always changing and deprecating fields and objects. If that happens, we cannot fix the mobile application that is already released in the market without replacing those hard codes and assumptions. This is one of the bug-prone parts of the mobile application when working with REST endpoints. These changes will happen again and again during the lifetime of an application. The mobile developer’s job is to maintain those hard codes and keep the parity between the APIs response and the application mapping logic. Any change on server APIs has to be announced and that forces the mobile developers to update their code.

The problem described above can be somewhat alleviated by adding frameworks to control the flow and providing more API documentation, such as The OpenAPI Specification (OAS). However, this does not actually solve the problem as part of the endpoint itself, and adds a workaround or dependencies on different frameworks.

On the other side, GraphQL addresses the aforementioned concerns. GraphQL APIs are strongly typed and a self-documented contract between server and clients. Strongly typed means each type of data is predefined as part of the language. This makes it easy for clients to be always in sync with server data types. There are no more statical types in your mobile application and no JSON mapping with the static data types in the codebase. Mobile apps’ objects will always be synced with the server objects and developers will get the updates and deprecations at compile time.

GraphQL endpoints are defined by schemas. Introspection is the system in GraphQL that enables tooling systems to generate code for different languages and platforms. Deprecation is a good example of describing introspection. It can be added so that each field would have a isDeprecated boolean and a replicationReason. This GraphQL tool become very useful as it shows warnings and feedback on compile-time in a mobile project.

As an example, the below JSON is the response from an API endpoint:

The price field on product is received as a String type and the client has the mapping below to convert the JSON data in to a swift model:

Let price = product[“price”] as? String

 

This type casting is how a mobile application transfers the JSON data to understandable data for UI layers. Basically, mobile clients have to statically define the type of each field and this is independent of server’s objects.

On the other side, GraphQL removes these static type castings. Client and server will always be tightly coupled and in sync. In the example above, Product type will be in the schema in GraphQL documentation as a contract between client and server, and price will always be the type that is defined in that contract. So the client is no longer keeping static types for each field.

Trade-offs

Note that customization comes with a cost. It is the client developer's responsibility to keep the performance high while taking advantage of the customization. The choice between using REST API vs GraphQL is up to the developer based on the project but in general REST API endpoint is defined in a more optimized way. That means each endpoint only receives a defined input and it returns a defined output and no more than that. GraphQL endpoints can be customized and clients can ask for anything in a single request. But clients also need to be careful about the costs of this flexibility. We are going to talk about GraphQL query costs later but having cost doesn't mean we can't reach the same optimization as REST API with GraphQL. Query cost should be considered when taking advantage of the customization feature.

Tips and Best Practices

To use GraphQL queries in a mobile project, you need to have a code generator tool to generate the client-side files representing your GraphQL queries, mutations, and responses. The tool we use at Shopify is called Syrup. Syrup is open source and generates strongly-typed Swift and Kotlin codes based on the GraphQL queries used in your app. Let's look at some examples of GraphQL queries in a mobile application and learn some tips. The examples and screenshots are from Shopify POS application.

Fragments and Screens in Mobile Apps

Defining fragments usually depends on the application UI. In this example, the order details screen in Shopify POS application shows lineItems on an order but it also has a sub screen which shows an event on order with related lineItems. For example, order details on the top image and return event screen with the lineItems that are returned on the bottom.

Fragments and Screens in Mobile Apps

 

Fragments: return event screen with the lineItems that are returned on the bottom


In this example lineItem rows in both screens are exactly the same and the view to create that row receives exactly the same information to create the view. Assuming each screen calls a query to get the information they need. They both need the same fields on the lineItem object. So, OrderLineItem object is basically a shared object between more than one screen and also between more than one query in the app. With GraphQL query we define orderLineItem as a fragment so it can be reusable and it guarantees that the lineItem view gets all the fields it needs every time the app fetches lineItem using this fragment. See query examples below:

Fragments with Common Fields but Different Names

Fragments can be customized on the client side and usually in mobile applications very much depends on the UI. Defining more fragments does not affect query cost so it's free and it gives your query a good structure. A good tip about using fragments is that not only you can break down the fields into multiple fragments but also you can put the same fields in multiple fragments and again it does not add cost to the query. For example, sometimes applications present repetitive data in more than one screen. In our OrderDetails screen example, the POS app presents high-level payment information about the order in the orderDetails screen (such as: subtotal, discount, total, etc.), but order can have a longer payment history (including change, failed transactions, etc.). Order history is presented in sub screens if the user selects to see that information. Assuming we only call one query to get all the information, we can have two fragments: OrderPayments, OrderHistory.

See fragments below:

Defining these fragments makes it easier to pass the data around and it does not affect the performance or cost of query. We are going to talk more about query cost later.

Customize Query Response to Benefit your App’s UX

With GraphQL you are able to customize your query/mutation response for the benefit of your application UI. If you have used REST API for a mobile application before you will appreciate the power that GraphQL can bring into your app. For example, after calling a mutation on an Order object, you can define the response of the mutation call with the fields you need to build your next screen. If the mutation is adding a lineItem to an order object and your next screen is to show the total price of the order, you can define the response object to include the totalPrice field on order so you can easily build your UI without having to fetch the updated order object. See mutation example below:

This flexibility is not possible with REST API without asking the server team to change the response object for the specific REST API endpoint.

Use Aliases to Have Readable Data Models Based on your App’s UI

If you are building the UI based on directly using the GraphQL objects, you can use aliases to rename the fields anything you want. A small tip about using aliases is that you can use aliases to rename a field but also if you add an extension to the object you can have the original field’s name as a new variable with added logic. See example below:


Use Directives and Add Logic to Your Query

Directives are mentioned in GraphQL documentation as a way to avoid string manipulation for server side code, but it also has advantages for a mobile client. For example, for the Order details screen, POS needs different fields on order based on the type of an order. If order is a pickup order, the OrderDetails screen needs more information about fulfillments and does not need information about shipping details. With directives you can pass boolean variables from your UI to the query to include or skip fields. See below query example:

We can add directives on fragments or fields. This enables mobile applications to fetch only the data that the UI needs and not more than that. This flexibility isn’t possible with REST API endpoints without having two different endpoints and having code in the mobile app codebase to switch between endpoints based on the boolean variable.

GraphQL Performance

GraphQL gives all the power and simplicity to your mobile application and some work is now transferred to the server-side to give clients the flexibility. On the client side, we have to consider the costs of a query we build. The cost of the query affects performance directly as it affects the responsiveness of your application and the resources on the server. This is not something that is usually mentioned when talking about GraphQL, but at Shopify we care about performance on both client-side and server-side.

Different GraphQL servers might have different API rate limiting methods. At Shopify, calls to GraphQL APIs are limited based on calculated query cost, which means the cost of query per minute and is more important than the number of query calls per minute. Each field in the schema has an integer cost value, and the sum of all these costs will be the cost of the query we build on the client side.

In simple words, each user has a bucket of maximum query cost per minute and each second the bucket will be refilled after each query execution. Obviously, complex queries will take up a proportionally larger amount of that bucket. To be able to start an execution of a query bucket app should have enough room for the complexity of the request query. That is the reason why on the client side we should care about our calculated query cost. There are tips and ways to improve the query cost in general, as described here.

Future of GraphQL

GraphQL is more than just a graph query language. It’s language-independent and flexible to serve any platform’s needs. It is built to serve clients where network bandwidth, latency and UX is critical. We mentioned the pain points when using REST in mobile applications and how GraphQL can address many of those concerns. GraphQL allows you to build whatever you need for the client and fulfill it in your own way. GraphQL is already an immense move forward from REST API design, addressing directly the models of data that need to be transferred between each client and server to do the job. At Shopify, we believe in the future of GraphQL and that is why Shopify has offered APIs in GraphQL since 2018.

Mary is a senior developer in Retail at Shopify. She has tons of experience in Swift and iOS development in general, and has been coding Swift since 2014. She's contributed to a variety of apps and since joining Shopify she's been on the Point Of Sale (POS) app team. She recently switched to React Native and started learning JavaScript and React. If you want to connect with Mary, check her out on Twitter.

Continue reading

Apache Beam for Search: Getting Started by Hacking Time

Apache Beam for Search: Getting Started by Hacking Time

To create relevant search, processing clickstream data is key: you frequently want to promote search results that are being clicked on and purchased, and demote those things users don’t love.

Typically search systems think of processing clickstream data as a batch job run over historical data, perhaps using a system like Spark. But on Shopify’s Discovery team, we ask the question: What if we could auto-tune relevance in real-time as users interact with search results—not having to wait days for a large batch job to run?

At Shopify—this is what we’re doing! We’re using streaming data processing systems that can process both real-time and historic data to enable real-time use cases ranging from simple auto boosting or down boosting of documents, to computing aggregate click popularity statistics, building offline search evaluation sets, and on to more complex reinforcement learning tasks.

But this article is introducing you to the streaming system themselves. In particular, to Apache Beam. And the most important thing to think about is time with those streaming systems. So let’s get started!

What Exactly is Apache Beam?

Apache Beam is a unified batch and stream processing system. This lets us potentially unify historic and real-time views of user search behaviors in one system. Instead of a batch system, like Spark, to churn over months of old data, and a separate streaming system, like Apache Storm, to process the live user traffic, Beam hopes to keep these workflows together.

For search, this is rather exciting. It means we can build search systems that both rely on historic search logs while perhaps being able to live-tune the system for our users’ needs in various ways.

Let’s walk through an early challenge everyone faces with Beam: that of time! Beam is a kind of time machine that has to reorder events in their right spot after getting annoyingly delayed by lots of intermediate processing and storage step. This is one of the core complications of a streaming system - how long do we wait? How do we deal with late or out of order data?

So to get started with Beam, the first thing you’ll need to do is Hack Time!

The Beam Time Problem

At the core of Apache Beam are pipelines. They connect a source through various processing steps to finally a sink.  

Data flowing through a pipeline is timestamped. When you consider a streaming system, this makes sense. We have various delays as events flow from browsers, through APIs, and other data systems. Finally the events arrive at our Beam pipeline. They can easily be out-of-order or delayed. Beam source APIs, like the one for Kafka, maintain a moving view of the event data to emit well-ordered events known as a watermark.

If we don’t give our Beam source good information on how to build a timestamp, we’ll drop events or receive them in the wrong order. But even more importantly for search, we likely must combine different streams of data to build a single view on a search session or query, like below:

combine different streams of data to build a single view on a search session or query, like below

Joining (a Beam topic for another day!) needs to look back over each source’s watermark and ensure they’re aligned in time before deciding that sufficient time has elapsed before moving on. But before you get to the complexities of streaming joins, replaying with accurate timestamps is the first milestone on your Beam-for-clickstream journey.

Configuring the Timestamp Right at the Source

Let’s set up a simple Beam pipeline to explore Beam. Here we’ll use Kafka in Java as an example. You can see the full source code in this gist.

Here we’ll set up a Kafka source, the start of a pipeline producing a custom SearchQueryEvent stored in a search_queries_topic.

You’ll notice we have information on the topic/servers to retrieve the data, along with how to deserialize the underlying binary data. We might add further processing steps to transform or process our SearchQueryEvents, eventually sending the final output to another system.

But nothing about time yet. By default, the produced SearchQueryEvents will use Kafka processing time. That is, when they’re read from Kafka. This is the least interesting for our purposes. We care about when users actually searched and clicked on results.

More interesting is when the event was created in a Kafka client. Which we can add here:

.withCreateTime(Duration.standardMinutes(5))

You’ll notice above, when we use create time below, we need to give the source’s Watermark a tip for how out of order event times might be. For example, below we instruct the Kafka source to use create time, but with a possible 5 minutes of discrepancy. 

Appreciating The Beam Time Machine

Let’s reflect on what such a 5 minute possible delay actually means from the last snippet. Beam is kind of a time machine… How Beam bends space-time is where your mind can begin to hurt.

As you might be picking up, event time  is quite different from processing time! So in the code snippet above, we’re *not* telling the computer to wait for 5 minutes of execution time for more data. No, the event time might be replayed from historical data, where 5 minutes of event time is replayed through our pipeline in mere milliseconds. Or it could be event time is really now, and we’re actively streaming live data for processing. So we DO indeed wait 5 real minutes! 

Let’s take a step back and use a silly example to understand this. It’s really crucial to your Beam journey. 

Imagine we’re super-robot androids that can watch a movie at 1000X speed. Maybe like Star Trek The Next Generation’s Lt Commander Data. If you’re unfamiliar, he could process input as fast as a screen could display! Data might say “Hey look, I want to watch the classic 80s movie, The Goonies, so I can be a cultural reference for the crew of the Enterprise.” 

Beam is like watching a movie in super-fast forward mode with chunks of the video appearing possibly delayed or out of order relative to other chunks in movie time. In this context we have two senses of time:

  • Event Time: the timestamp in the actual 1h 55 minute runtime of The Goonies aka movie time.
  • Processing Time: the time we actually experience The Goonies (perhaps just a few minutes if we’re super-robot androids like Data).

So Data tells the Enterprise computer “Look, play me The Goonies as fast as you can recall it from your memory banks.” And the computer has various hiccups where certain frames of the movie aren’t quite getting to Data’s screen to keep the movie in order. 

Commander Data can tolerate missing these frames. So Data says “Look, don’t wait more than 5 minutes in *movie time* (aka event time) before just showing me what you have so far of that part of the movie. This lets Data watch the full movie in a short amount of time, dropping a tolerable number of movie frames.

This is just what Beam is doing with our search query data. Sometimes it’s replaying days worth of historic search data in milliseconds, and other times we’re streaming live data where we truly must wait 5 minutes for reality to be processed. Of course, the right delay might not be 5 minutes, it might be something else appropriate to our needs. 

Beam has other primitives such as windows which further inform, beyond the source, how data should be buffered or collected in units of time. Should we collect our search data in daily windows? Should we tolerate late data? What does subsequent processing expect to work over? Windows also work with the same time machine concepts that must be appreciated deeply to work with Beam.

Incorporating A Timestamp Policy

Beam might know a little about Kafka, but it really doesn’t know anything about our data model. Sometimes we need even more control over the definition of time in the Beam time machine.

For example, in our previous movie example, movie frames perhaps have some field informing us of how they should be arranged in movie time. If we examine our SearchQueryEvent, we also see a specific timestamp embedded in the data itself:

public class SearchQueryEvent {

   public final String queryString;

   public final Instant searchTimestamp;

}

Well Beam sources can often be configured to use a custom event time like our searchTimestamp. We just need to make a TimestampPolicy. We simply provide a simple function-class that takes in our record (A key-value of Long->SearchQueryEvent) and returns a timestamp:

We can use this to create our own timestamp policy:

Here, we’ve passed in our own function, and we’ve given the same allowed delay (5 minutes). This is all wrapped up in a factory class TimestampPolicyFactory SearchQueryTimestampPolicyFactory (now if that doesn’t sound like a Java class name, I don’t know what does ;) )

We can add our timestamp policy to the builder:

.withTimestampPolicyFactory(new SearchQueryTimestampPolicyFactory())

Hacking Time!

Beam is about hacking time, I hope you’ve appreciated this walkthrough of some of Beam’s capabilities. If you’re interested in joining me on building Shopify’s future in search and discovery, please check out these great job postings!

Doug Turnbull is a Sr. Staff Engineer in Search Relevance at Shopify. He is known for writing the book “Relevant Search”, contributing to “AI Powered Search”, and creating relevance tooling for Solr and Elasticsearch like Splainer, Quepid, and the Elasticsearch Learning to Rank plugin. Doug’s team at Shopify helps Merchants make their products and brands more discoverable. If you’d like to work with Doug, send him a Tweet at @softwaredoug!

Continue reading

How Shopify Uses WebAssembly Outside of the Browser

How Shopify Uses WebAssembly Outside of the Browser

On February 24, 2021, Shipit!, our monthly event series, presented Making Commerce Extensible with WebAssembly. The video is now available.

At Shopify we aim to make what most merchants need easy, and the rest possible. We make the rest possible by exposing interfaces to query, extend and alter our Platform. These interfaces empower a rich ecosystem of Partners to solve a variety of problems. The primary mechanism of this ecosystem is an “App”, an independently hosted web service which communicates with Shopify over the network. This model is powerful, but comes with a host of technical issues. Partners are stretched beyond their available resources as they have to build a web service that can operate at Shopify’s scale. Even if Partners’ resources were unlimited, the network latency incurred when communicating with Shopify precludes the use of Apps for time sensitive use cases.

We want Partners to focus on using their domain knowledge to solve problems, and not on managing scalable web services. To make this a reality we’re keeping the flexibility of untrusted Partner code, but executing it on our own infrastructure. We choose a universal format for that code that ensures it’s performant, secure, and flexible: WebAssembly.

WebAssembly

What is WebAssembly? According to WebAssembly.org

“WebAssembly (abbreviated Wasm) is a binary instruction format for a stack-based virtual machine. Wasm is designed as a portable compilation target for programming languages, enabling deployment on the web for client and server applications.”

To learn more, see this series of illustrated articles written by Lin Clark of Mozilla with information on WebAssembly and its history.

Wasm is often presented as a performant language that runs alongside JavaScript from within the Browser. We, however, execute Wasm outside of the browser and with no Javascript involved. Wasm, far from being solely a Javascript replacement, is designed for Web and Non-Web Embeddings alike. It solves the more general problem of performant execution in untrusted environments, which exists in browsers and code execution engines alike. Wasm satisfies our three main technical requirements: security, performance, and flexibility.

Security

Executing untrusted code is a dangerous thing—it's exceptionally difficult to predict by nature, and it has potential to cause harm to Shopify’s platform at large. While no application is entirely secure, we need to both prevent security flaws and mitigate their impacts when they occur.

Wasm executes within a sandboxed stack-based environment, relying upon explicit imports to allow communication with the host. Because of this, you cannot express anything malicious in Wasm. You can only express manipulations of the virtual environment and use provided imports. This differs from bytecodes which have references to the computers or operating systems they expect to run on built right into the syntax.

Wasm also hosts a number of features which protect the user from buggy code, including protected call stacks and runtime type checking. More details on the security model of Wasm can be found on WebAssembly.org.

Performance

In ecommerce, speed is a competitive advantage that merchants need to drive sales. If a feature we deliver to merchants doesn’t come with the right tradeoff of load times to customization value, then we may as well not deliver it at all.

Wasm is designed to leverage common hardware capabilities that provide it near native performance on a wide variety of platforms. It’s used by a community of performance driven developers looking to optimize browser execution. As a result, Wasm and surrounding tooling was built, and continues to be built, with a performance focus.

Flexible

A code execution service is only as useful as the developers using it are productive. This means providing first class development experiences in multiple languages they’re familiar with. As a bytecode format, Wasm is targeted by a number of different compilers. This allows us to support multiple languages for developer use without altering the underlying execution model.

Community Driven

We have a fundamental alignment in goals and design, which provides our “engineering reason” for using Wasm. But there’s more to it than that—it’s about the people as well as the technology. If nobody was working on the Wasm ecosystem, or even if it was just on life support in its current state, we wouldn’t use it. WebAssembly is an energized community that’s constantly building new things and has a lot of potential left to reach. By becoming a part of that community, Shopify stands to gain significantly from that enthusiasm.

We’re also contributing to that enthusiasm ourselves. We’re collecting user feedback, discussing feature gaps, and most importantly contributing to the open source tools we depend on. We think this is the start of a healthy reciprocal relationship between ourselves and the WebAssembly community, and we expect to expand these efforts in the future.

Architecture of our Code Execution Service

Now that we’ve covered WebAssembly and why we’re using it, let’s move onto how we’re executing it.

We use an open source tool called Lucet (originally written by Fastly). As a company, Fastly provides a programmable edge cloud platform. They’re trying to bring execution of high-volume, short-lived, and untrusted modules closer to where they’re being requested. This is the same as the problem we’re trying to solve with our Partner code, so it’s a natural fit to be using the same tools.

Lucet

Lucet is both a runtime and a compiler for Wasm. Modules are represented in Wasm for the safety that representation provides. Recall that you can’t express anything malicious in Wasm. Lucet takes advantage of this and uses a validation of the Wasm module as a security check. After the validation, the module is compiled to an executable artifact with near bare metal performance. It also supports ahead of time compilation, allowing us to have these artifacts ready to execute at runtime. Lucet containers boast an impressive startup time of 35 μs. That’s because it’s a container that doesn’t need to do anything at all to start up.  If you want the full picture, Tyler McMullan, the CTO of Fastly, did a great talk which gives an overview of Lucet and how it works.

A flow diagram showing how Shopify uses our Wasm engine: Lucet wrapped within a Rust web service which manages the I/O and storage of modules
A flow diagram showing Shopify's Wasm engine

We wrap Lucet within a Rust web service which manages the I/O and storage of modules, which we call the Wasm Engine. This engine is called by Shopify during a runtime process, usually a web request, in order to satisfy some function. It then applies the output in context of the callsite. This application could involve the creation of a discount, the enforcement of a constraint, or any form of synchronous behaviour Merchants want to customize within the Platform.

Execution Performance

Here’s some metrics pulled from a recent performance test. During this test, 100k modules were executed per minute for approximately 5 min. These modules contained a trivial implementation of enforcing a limit on the number of items purchased in a cart. 

A line graph showcasing the time taken to execute a module. The x axis representing the time over the test was running and the y axis is the time represented in ms
Time taken to execute a module

This chart demonstrates a breakdown of the time taken to execute a module, including I/O with the container and the execution of the module. The y-axis is time in ms, the x-axis is the time over which the test was running.

The light purple bar shows the time taken to execute the module in Lucet, the width of which hovers around 100 μs. The remaining bars deal with I/O and engine specifics, and the total time of execution is around 4 ms. All times are 99th percentiles (p99).To put these times in perspective, let’s compare these times to the request times of Storefront Renderer, our performant Online Store rendering service:

A line graph showing Storefront Renderer Response time
Storefront Renderer response time

This chart demonstrates the request time to Storefront Renderer over time. The y-axis is request time in seconds. The x-axis is the time over which the values were retrieved. The light blue line representing the 99th percentile hovers around 700 ms.

Then if we consider the time taken by our module execution process to be generally under 5 ms, we can say that the performance impact of Lucet execution is negligible.

Generating WebAssembly

To get value out of our high performance execution engine, we’ll need to empower developers to create compatible Wasm modules. Wasm is primarily intended as a compilation target, rather than something you write by hand (though you can write Wasm by hand). This leaves us with the question of what languages we’ll support and to what extent.

Theoretically any language with a Wasm target can be supported, but the effort developers spend to conform to our API is better focused on solving problems for merchants. That’s why we’ve chosen to provide first class support to a single language that includes tools that get developers up and running quickly.At Shopify, our language of choice is Ruby. However, because Ruby is a dynamic language, we can’t compile it down to Wasm directly. We explored solutions involving compiling interpreters, but found that there was a steep performance penalty. Because of this, we decided to go with a statically compiled language and revisit the possibility of dynamic languages in the future.

Through our research we found that developers in our ecosystem were most familiar with Javascript. Unfortunately, Javascript was precluded as it’s a dynamic language like Ruby. Instead, we chose a language with familiar TypeScript-like syntax called AssemblyScript.

Using AssemblyScript

At first glance, there are a huge number of languages that support a WebAssembly target. Unfortunately, there are two broad categories of WebAssembly compilers which we can’t use:

  • Compilers that generate environment or language specific artifacts, namely node or the browser. (Examples: Asterius, Blazor)
  • Compilers that are designed to work only with a particular Runtime. The modules generated by these compilers rely upon special language specific imports. This is often done to support a language’s standard library, which expects certain system calls or runtime features to be available. Since we don’t want to be locked down to a certain language or tool, we don’t use these compilers. (Examples: Lumen)

These are powerful tools in the right conditions, but aren’t built for our use case. We need tools that produce WebAssembly, rather than tools which are powered by WebAssembly. AssemblyScript is one such tool.

AssemblyScript, like many tools in the WebAssembly space, is still under development. It’s missing a few key features, such as closure support, and it still has a number of edge case bugs. This is where the importance of the community comes in.

The language and the tooling around AssemblyScript has an active community of enthusiasts and maintainers who have supported Shopify since we first started using the language in 2019. We’ve supported the community through an OpenCollective donation and continuing code contributions. We’ve written a language server, made some progress towards implementing closures, and have written bug fixes for the compiler and surrounding tooling.

We’ve also integrated AssemblyScript into our own early stage tooling. We’ve built integrations into the Shopify CLI which will allow developers to create, test, and deploy modules from their command line. To improve developer ergonomics, we provide SDKs which handle the low level implementation concerns of Shopify defined objects like “Money”. In addition to these tools, we’re building out systems which allow Partners to monitor their modules and receive alerts when their modules fail. The end goal is to give Partners the ability to move their code onto our service without losing any of the flexibility or observability they had on their own platform.

New Capabilities, New Possibilities

As we tear down the boundaries between Partners and Merchants, we connect merchants with the entrepreneurs ready to solve their problems. If you have ideas on how our code execution could help you and the Apps you own or use, please tweet us at @ShopifyEng. To learn more about Apps at Shopify and how to get started, visit our developer page.

Duncan is a Senior Developer at Shopify. He is currently working on the Scripts team, a team dedicated to enabling and managing untrusted code execution within Shopify for Merchants and Developers alike.

Shipit! Presents: Making Commerce Extensible with WebAssembly

 


If you love working with open source tools, are passionate about API design and extensibility, and want to work remotely, we’re always hiring! Reach out to us or apply on our careers page.

Continue reading

Simplify, Batch, and Cache: How We Optimized Server-side Storefront Rendering

Simplify, Batch, and Cache: How We Optimized Server-side Storefront Rendering

On December 16, 2020 we held Shipit! presents: Performance Tips from the Storefront Renderer Team. A video for the event is now available for you to learn more about how the team optimized this Ruby application for the particular use case of serving storefront traffic. Click here to watch the video.

By Celso Dantas and Maxime Vaillancourt

In the previous post about our new storefront rendering engine, we described how we went about the rewrite process and smoothly transitioned to serve storefront requests with the new implementation. As a follow-up and based on readers’ comments and questions, this post dives deeper into the technical details of how we built the new storefront rendering engine to be faster than the previous implementation.

To set the table, let’s see how the new storefront rendering engine performs:

  • It generates a response in less than ~45ms for 75% of storefront requests;
  • It generates a response in less than ~230ms for 90% of storefront requests;
  • It generates a response in less than ~900ms for 99% of storefront requests.

Thanks to the new storefront rendering engine, the average storefront response is nearly 5x faster than with the previous implementation. Of course, how fast the rendering engine is able to process a request and spit out a response depends on two key factors: the shop’s Liquid theme implementation, and the number of resources needed to process the request. To get a better idea of where the storefront rendering engine spends its time when processing a request, try using the Shopify Theme Inspector: this tool will help you identify potential bottlenecks so you can work on improving performance in those areas.

A data scheme diagram showing that the Storefront Renderer and Redis instance are contained in a Kubernetes node. The Storefront Renderer sends Redis data. The Storefront Renderer sends data to two sharded data stores outside of the Kubernetes node: Sharded MySQL and Sharded Redis
A simplified data schema of the application

Before we cover each topic, let’s briefly describe our application stack. As mentioned in the previous post, the new storefront rendering engine is a Ruby application. It talks to a sharded MySQL database and uses Redis to store and retrieve cached data.

Optimizing how we load all that data is extremely important. As one of our requirements was to improve rendering time for Storefront requests. Here are some of the approaches that we took to accomplish that.

Using MySQL’s Multi-statement Feature to Reduce Round Trips

To reduce the number of network round trips to the database, we use MySQL’s multi-statement feature to allow sending multiple queries at once. With a single request to the database, we can load data from multiple tables at once. Here’s a simplified example:

This request is especially useful to batch-load a lot of data very early in the response lifecycle based on the incoming request. After identifying the type of request, we trigger a single multi-statement query to fetch the data we need for that particular request in one go, which we’ll discuss later in this blog post. For example, for a request for a product page, we’ll load data for the product, its variants, its images, and other product-related resources in addition to information about the shop and the storefront theme, all in a single round-trip to MySQL.

Implementing a Thin Data Mapping Layer

As shown above, the new storefront rendering engine uses handcrafted, optimized SQL queries. This allows us to easily write fine-tuned SQL queries to select only the columns we need for each resource and leverage JOINs and sub-SELECT statements to optimize data loading based on the resources to load which are sometimes less straightforward to implement with a full-service object-relational mapping (ORM) layer.

However, the main benefit of this approach is the tiny memory footprint of using a raw MySQL client compared to using an object-relational mapping (ORM) layer that’s unnecessarily complex for our needs. Since there’s no unnecessary abstraction, forgoing the use of an ORM drastically simplifies the flow of data. Once the raw rows come back from MySQL, we effectively use the simplest ORM possible: we create plain old Ruby objects from the raw rows to model the business domain. We then use these Ruby objects for the remainder of the request. Below is an example of how it’s done.

Of course, not using an ORM layer comes with a cost: if implemented poorly, this approach can lead to more complexity leaking into the application code. Creating thin model abstractions using plain old Ruby objects prevents this from happening, and makes it easier to interact with resources while meeting our performance criteria. Of course, this approach isn’t particularly common and has the potential to cause panic in software engineers who aren’t heavily involved in performance work, instead worrying about schema migrations and compatibility issues. However, when speed is critical, we accept to take on that complexity.

Book-keeping and Eager-loading Queries

An HTTP request for a Shopify storefront may end up requiring many different resources from data stores to render properly. For example, a request for a product page could lead to requiring information about other products, images, variants, inventory information, and a whole lot of other data not loaded on multi-statement select. The first time the storefront rendering engine loads this page, it needs to query the database, sometimes making multiple requests, to retrieve all the information it needs. This usually happens during the request at any given time.

A flow diagram showing the Storefront Renderer's requests from  the data stores and how it uses a Query Book Keeper Middlewear to eager-load data
Flow of a request with the Book-keeping solution

As it retrieves this data for the first time, the storefront rendering engine keeps track of the queries it performed on the database for that particular product page and stores that list of queries in a key-value store for later use. When an HTTP request for the same product page comes in later (which it knows when the cache key matches), the rendering engine looks up the list of queries it performed throughout the previous request of the same type and performs those queries all at once, at the very beginning of the current request, because we’re pretty confident we’ll need them for this request (since they were used in the previous request).

This book-keeping mechanism lets us eager-load data we’re pretty confident we’ll need. Of course, when a page changes, this may lead to over-fetching and/or under-fetching, which is expected, and the shape of the data we fetch stabilizes quickly over time as more requests come in.

On the other side, some liquid models of Shopify’s storefronts are not accessed as frequently, and we don’t need to eager-load data related to them. If we did, we’d increase I/O wait time for something that we probably wouldn’t use very often. What the new rendering engine does instead is lazy-load this data by default. Unless the book-keeping mechanism described above eager-loads it, we’ll defer retrieving data to only load it if it’s needed for a particular request.

Implementing Caching Layers

Much like a CPU’s caching architecture, the new rendering engine implements multiple layers of caching to accelerate responses.

A critical aside before we jump into this section: adding caching should never be the first step towards building performance-oriented software. Start by building a solution that’s extremely fast from the get go, even without caching. Once this is achieved, then consider adding caching to reduce load on the various components on the system while accelerating frequent use cases. Caching is like a sharp knife and can introduce hard to detect bugs.

In-Memory Cache

A data scheme diagram showing that the Storefront Renderer and Redis instance are contained in a Kubernetes node. Within the Storefront Renderer is an In-memory cache. The Storefront Renderer sends Redis data. The Storefront Renderer sends data to two sharded data stores outside of the Kubernetes node: Sharded MySQL and Sharded Redis
A simplified data schema of the application with an in-memory cache for the Storefront Renderer

At the frontline of our caching system is an in-memory cache that you can essentially think of as a global hash that’s shared across requests within each web worker. Much like the majority of our caching mechanisms, this caching layer uses the LRU caching algorithm. As a result, we use this caching layer for data that’s accessed very often. This layer is especially useful in high throughput scenarios such as flash sales.

Node-local Shared Caching

As a second layer on top of the in-memory cache, the new rendering engine leverages a node-local Redis store that’s shared across all server workers on the same node. Since the database is available on the same machine as the rendering engine process itself, this node-local data transfer prevents network overhead and improves response times. As a result, multiple Ruby processes benefit from sharing cached data with one another.

Full-page Caching

Once the rendering engine successfully renders a full storefront response for a particular type of request, we store the final output (most often an HTML or JSON string) into the local Redis for later retrieval for subsequent requests that match the same cache key. This full-page caching solution lets us prevent regenerating storefront responses if we can by using the output we previously computed.

Database Query Results Caching

In a scenario where the full-page output cache, the in-memory cache, and the node-local cache doesn’t have a valid entry for a given request, we need to reach all the way to the database. Once we get a result back from MySQL, we transparently cache the results in Redis for later retrieval based on the queries and their parameters. As long as the cache keys don’t change, running the same database queries over and over always hit Redis instead of reaching all the way to the database.

Liquid Object Memoizer

Thanks to the Liquid templating language, merchants and partners may build custom storefront themes. When loading a particular storefront page, it’s possible that the Liquid template to render includes multiple references to the same object. This is common on the product page for example, where the template will include many references to the product object:
{{ product.title }}, {{ product.description }}, {{ product.featured_media }}, and others.

Of course, when each of these are executed, we don’t fetch the product over and over again from the database—we fetch it once, then keep it in memory for later use throughout the request lifecycle. This means that if the same product object is required multiple times at different locations during the render process, we’ll always use the same one and only instance of it throughout the entire request lifecycle.

The Liquid object memoizer is especially useful when multiple different Liquid objects end up loading the same resource. For example, when loading multiple product objects on a collection page using {{ collection.products }} and then referring to a particular product using {{ all_products[‘cowboy-hat’] }} on a collection page, with the Liquid object memoizer we’ll load it from an external data store once, then store it in memory and fetch it from there if it’s needed later. On average, across all Shopify storefronts, we see that the Liquid object memoizer prevents between 16 and 20 accesses to Redis and/or MySQL for every single storefront request, where we leverage the in-memory cache instead. In some extreme cases, we see that the memoizer prevents up to 4,000 calls to data stores per request.

Reducing Memory Allocations

Writing Memory-aware Code

Garbage collection execution is expensive. So we write code that doesn’t generate unnecessary objects. Use of methods and algorithms that modify objects in place, instead of generating a new object. For example:

  • use map! instead of map when dealing with lists. It prevents a new Array object from being created.
  • Use string interpolation instead of string concatenation. Interpolation does not create intermediate unnecessary String objects.

This may not seem like much, but consider this: using #map! instead of #map could reduce your memory usage significantly, even when simply looping over an array of integers to double the values.

Let’s set up an following array of 1000 integers from 1 to 1000:

array = (1..1000).to_a

Then, let’s double each number in the array with Array#map:

array.map { |i| i * 2 }

The line above leads to one object allocated in memory, for a total of 8040 bytes.

Now let’s do the same thing with Array#map! instead:

array.map! { |i| i * 2 }

The line above leads to zero object allocated in memory, for a total of 0 bytes.

Even with this tiny example, using map! instead of map saves ~8 kilobytes of allocated memory, and considering the sheer scale of the Shopify platform and the storefront traffic throughput it receives, every little bit of memory optimization counts to help the garbage collector run less often and for smaller periods of time, thus improving server response times.

With that in mind, we use tracing and profiling tools extensively to dive deeper into areas in the rendering engine that are consuming too much memory and to make precise changes to reduce memory usage.

Method-specific Memory Benchmarking

To prevent accidentally increasing memory allocations, we built a test helper method that lets us benchmark a method or a block to know many memory allocations and allocated bytes it triggers. Here’s how we use it:

This benchmark test will succeed if calling Product.find_by_handle('cowboy-hat') matches the following criteria:

  • The call allocates between 48 and 52 objects in memory;
  • The call allocates between 5100 and 5200 bytes in memory.

We allow a range of allocations because they’re not deterministic on every test run. This depends on the order in which tests run and the way data is cached, which can affect the final number of allocations.

As such, these memory benchmarks help us keep an eye on memory usage for specific methods. In practice, they’ve prevented introducing inefficient third-party gems that bloat memory usage, and they’ve increased awareness of memory usage to developers when working on features.

We covered three main ways to improve server-side performance: batching up calls to external data stores to reduce roundtrips, caching data in multiple layers for specific use cases, and simplifying the amount of work required to fulfill a task by reducing memory allocations. When they’re all combined, these approaches lead to big time performance gains for merchants on the platform—the average response time with the new rendering engine is 5x faster than with the previous implementation. 

Those are just some of the techniques that we are using to make the new application faster. And we never stop exploring new ways to speed up merchant’s storefronts. Faster rendering times are in the DNA of our team!

- The Storefront Renderer Team

Celso Dantas is a Staff Developer on the Storefront Renderer team. He joined Shopify in 2013 and has worked on multiple projects since then. Lately specializing in making merchants storefront faster.

Maxime Vaillancourt is a Senior Developer on the Storefront Rendering team. He has been at Shopify for 3 years, starting on the Online Store Themes team, then specializing towards storefront performance with the storefront rendering engine rewrite.

Shipit! Presents: Performance Tips from the Storefront Renderer Team

The Storefront Renderer is a server-side application that loads a Shopify merchant's storefront Liquid theme, along with the data required to serve the request (for example product data, collection data, inventory information, and images), and returns the HTML response back to your browser. On average, server response times for the Storefront Renderer are four times faster than the implementation it replaced.

Our blog post, How Shopify Reduced Storefront Response Times with a Rewrite generated great discussions and questions. This event looks to answer those questions and dives deeper into the technical details of how we made the Storefront Renderer engine faster.

​​​​​​​During this event you will learn how we:
  • optimized data access
  • implemented caching layers
  • reduced memory allocations

We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Continue reading

Resiliency Planning for High-Traffic Events

Resiliency Planning for High-Traffic Events

On January 27, 2021 Shipit!, our monthly event series, presented Building a Culture of Resiliency at Shopify. Learn about creating and maintaining resiliency plans for large development teams, testing and tooling, developing incident strategies, and incorporating and improving feedback loops. The video is now available.

Each year, Black Friday Cyber Monday weekend represents the peak of activity for Shopify. Not only is this the most traffic we see all year, but it’s also the time our merchants put the most trust in our team. Winning this weekend each year requires preparation, and it starts as soon as the weekend ends.

Load Testing & Stress Testing: How Does the System React?

When preparing for a high traffic event, load testing regularly is key. We have discussed some of the tools we use already, but I want to explain how we use these exercises to build towards a more resilient system.

While we use these tests to confirm that we can sustain required loads or probe for new system limits, we can also use regular testing to find potential regressions. By executing the same experiments on a regular basis, we can spot any trends at easily handled traffic levels that might spiral into an outage at higher peaks.

This same tool allows us to run similar loads against differently configured shops and look for differences caused by the theme, configuration, and any other dimensions we might want to use for comparison.

Resiliency Matrix: What are Our Failure Modes?

If you've read How Complex Systems Fail, you know that "Complex systems are heavily and successfully defended against failure" and "Catastrophe requires multiple failures - single point failures are not enough.” For that to be true, we need to understand our dependencies, their failure modes, and how those impact the end-user experience.

We ask teams to construct a user-centric resiliency matrix, documenting the expected user experience under various scenarios. For example:

This user-centric resiliency matrix shows the potential failures and their impact on user experience. For example, can a user browse (yes) or check out (no) if MySQL is down.
User-centric resiliency matrix documenting expected user experience and possible failures

The act of writing this matrix serves as a very basic tabletop chaos exercise. It forces teams to consider how well they understand their dependencies and what the expected behaviors are.

This exercise also provides a visual representation of the interactions between dependencies and their failure modes. Looking across rows and columns reveals areas where the system is most fragile. This provides the starting point for planning work to be done. In the above example, this matrix should start to trigger discussion around the ‘User can check out’ experience and what can be done to make this more resilient to a single dependency going ‘down’.

Game Days: Do Our Models Match?

So, we’ve written our resilience matrix. This is a representation of our mental model of the system, and when written, it's probably a pretty accurate representation. However, systems change and adapt over time, and this model can begin to diverge from reality.

This divergence is often unnoticed until something goes wrong, and you’re stuck in the middle of a production incident asking “Why?”. Running a game day exercise allows us to test the documented model against reality and adjust in a controlled setting.

The plan for the game day will derive from the resilience matrix. For the matrix above, we might formulate a plan like:

This game day exercise allows us to test the model against reality and adjust in a controlled setting. This plan lays out scenarios to be tested and how they will be accomplished.
Game day planning scenarios 

Here, we are laying out what scenarios are to be tested, how those will be accomplished, and what we expect to happen. 

We’re not only concerned with external effects (what works, what doesn’t), but internally do any expected alerts fire, are the appropriate on-call teams paged, and do those folks have the information available to understand what is happening?

If we refer back to How Complex Systems Fail, the defences against failure are technical, human, and organizational. On a good game day, we’re attempting to exercise all of these.

  • Do any automated systems engage?
  • Do the human operators have the knowledge, information and tools necessary to intervene?
  • Do the processes and procedures developed help or hinder responding to the outage scenario?

By tracking the actual observed behavior, we can then update the matrix as needed or make changes to the system in order to bring our mental model and reality back into alignment.

Incident Analysis: How Do We Get Better?

During the course of the year, incidents happen which disrupt service in some capacity. While the primary focus is always in restoring service as fast as possible, each incident also serves as a learning opportunity.

This article is not about why or how to run a post-incident review; there are more than enough well-written pieces by folks who are experts on the subject. But to refer back to How Complex Systems Fail, one of the core tenets in how we learn from incidents is “Post-accident attribution to a ‘root cause’ is fundamentally wrong.”

When focusing on a single root cause, we stop at easy, shallow actions to resolve the ‘obvious’ problem. However, this ignores deeper technical, organizational, and cultural issues that contributed to the issue and will again if uncorrected.

What’s Special About BFCM?

We’ve talked about the things we’re constantly doing, year-round to ensure we’re building for reliability and resiliency and creating an anti-fragile system that gets better after every disruption. So what do we do that’s special for the big weekend?

We’ve already mentioned How Complex Systems Fail several times, but to go back to that well once more, “Change introduces new forms of failure.” As we get closer to Black Friday, we slow down the rate of change.

This doesn’t mean we’re sitting on our hands and hoping for the best, but rather we start to shift where we’re investing our time. Fewer new services and features as we get closer, and more time spent dealing with issues of performance, reliability, and scale.

We review defined resilience matrices carefully, start running more frequent game days and load tests and working on any issues or bottlenecks those reveal. This means updating runbooks, refining internal tools, and shipping fixes for issues that this activity brings to light.

All of this comes together to provide a robust, reliable platform to power over $5.1 billion in sales.

Shipit! Presents: Building a Culture of Resiliency at Shopify

Watch Ryan talk about how we build a culture of resiliency at Shopify to ensure a robust, reliable platform powering over $5.1 billion in sales.

 

Ryan is a Senior Development Manager at Shopify. He currently leads the Resiliency team, a centralized globally distributed SRE team responsible for keeping commerce better for everyone.

We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Continue reading

How to Reliably Scale Your Data Platform for High Volumes

How to Reliably Scale Your Data Platform for High Volumes

By Arbab Ahmed and Bruno Deszczynski

Black Friday and Cyber Monday—or as we like to call it, BFCM—is one of the largest sales events of the year. It’s also one of the most important moments for Shopify and our merchants. To put it into perspective, this year our merchants across more than 175 countries sold a record breaking $5.1+ billion over the sales weekend. 

That’s a lot of sales. That’s a lot of data, too.

This BFCM, the Shopify data platform saw an average throughput increase of 150 percent. Our mission as the Shopify Data Platform Engineering (DPE) team is to ensure that our merchants, partners, and internal teams have access to data quickly and reliably. It shouldn’t matter if a merchant made one sale per hour or a million; they need access to the most relevant and important information about their business, without interruption. While this is a must all year round, the stakes are raised during BFCM.

Creating a data platform that withstands the largest sales event of the year means our platform services need to be ready to handle the increase in load. In this post, we’ll outline the approach we took to reliably scale our data platform in preparation for this high-volume event. 

Data Platform Overview

Shopify’s data platform is an interdisciplinary mix of processes and systems that collect and transform data for use by our internal teams and merchants. It enables access to data through a familiar pipeline:

  • Ingesting data in any format, from any part of Shopify. “Raw” data (for example, pageviews, checkouts, and orders) is extracted from Shopify’s operational tables without any manipulation. Data is then conformed to an Apache Parquet format on disk.
  • Processing data, in either batches or streams, to form the foundations of business insights. Batches of data are “enriched” with models developed by data scientists, and processed within Apache Spark or dbt
  • Delivering data to our merchants, partners, and internal teams so they can use it to make great decisions quickly. We rely on an internal collection of streaming and serving applications, and libraries that power the merchant-facing analytics in Shopify. They’re backed by BigTable, GCS, and CloudSQL.

In an average month, the Shopify data platform processes about 880 billion MySQL records and 1.75 trillion Kafka messages.

Tiered Services

As engineers, we want to conquer every challenge right now. But that’s not always realistic or strategic, especially when not all data services require the same level of investment. At Shopify, a tiered services taxonomy helps us prioritize our reliability and infrastructure budgets in a broadly declarative way. It’s based on the potential impact to our merchants and looks like this:

Tier 1

This service is critical externally, for example. to a merchant’s ability to run their business

Tier 2

This service is critical internally to business functions, e.g. a operational monitoring/alerting service

Tier 3

This service is valuable internally, for example, internal documentation services

Tier 4

This service is an experiment, in very early development, or is otherwise disposable. For example, an emoji generator


The highest tiers are top priority. Our ingestion services, called Longboat and Speedboat, and our merchant-facing query service Reportify are examples of services in Tier 1.

The Challenge 

As we’ve mentioned, each BFCM the Shopify data platform receives an unprecedented volume of data and queries. Our data platform engineers did some forecasting work this year and predicted nearly two times the traffic of 2019. The challenge for DPE is ensuring our data platform is prepared to handle that volume. 

When it comes to BFCM, the primary risk to a system’s reliability is directly proportional to its throughput requirements. We call it throughput risk. It increases the closer you get to the front of the data pipeline, so the systems most impacted are our ingestion and processing systems.

With such a titillating forecast, the risk we faced was unprecedented throughput pressure on data services. In order to be BFCM ready, we had to prepare our platform for the tsunami of data coming our way.

The Game Plan

We tasked our Reliability Engineering team with Tier 1 and Tier 2 service preparations for our ingestion and processing systems. Here’s the steps we took to prepare our systems most impacted by BFCM volume:

1. Identify Primary Objectives of Services

A data ingestion service's main operational priority can be different from that of a batch processing or streaming service. We determine upfront what the service is optimizing for. For example, if we’re extracting messages from a limited-retention Kafka topic, we know that the ingestion system needs to ensure, above all else, that no messages are lost in the ether because they weren’t consumed fast enough. A batch processing service doesn’t have to worry about that, but it may need to prioritize the delivery of one dataset versus another.

In Longboat’s case, as a batch data ingestion service, its primary objective is to ensure that a raw dataset is available within the interval defined by its data freshness service level objective (SLO). That means Longboat is operating reliably so long as every dataset being extracted is no older than eight hours— the default freshness SLO. For Reportify, our main query serving service, its primary objective is to get query results out as fast as possible; its reliability is measured against a latency SLO.

2. Pinpoint Service Knobs and Levers

With primary objectives confirmed, you need to identify what you can “turn up or down” to sustain those objectives.

In Longboat’s case, extraction jobs are orchestrated with a batch scheduler, and so the first obvious lever is job frequency. If you discover a raw production dataset is stale, it could mean that the extraction job simply needs to run more often. This is a service-specific lever.

Another service-specific lever is Longboat’s “overlap interval” configuration, which configures an extraction job to redundantly ingest some overlapping span of records in an effort to catch late-arriving data. It’s specified in a number of hours.

Memory and CPU are universal compute levers that we ensure we have control of. Longboat and Reportify run on Google Kubernetes Engine, so it’s possible to demand that jobs request more raw compute to get their expected amount of work done within their scheduled interval (ignoring total compute constraints for the sake of this discussion).

So, in pursuit of data freshness in Longboat, we can manipulate:

  1. Job frequency
  2. Longboat overlap interval
  3. Kubernetes Engine Memory/CPU requests

In pursuit of latency in Reportify, we can turn knobs like its:

  1. BigTable node pool size 
  2. ProxySQL connection pool/queue size

3. Run Load Tests!

Now that we have some known controls, we can use them to deliberately constrain the service’s resources. As an example, to simulate an unrelenting N-times throughput increase, we can turn the infrastructure knobs so that we have 1/N the amount of compute headroom, so we’re at N-times nominal load.

For Longboat’s simulation, we manipulated its “overlap interval” configuration and tripled it. Every table suddenly looked like it had roughly three times more data to ingest within an unchanged job frequency; throughput was tripled.

For Reportify, we leveraged our load testing tools to simulate some truly haunting throughput scenarios, issuing an increasingly extreme volume of queries, as seen here:

A line graph showing streaming service queries per second by source. The graph shows increase in the volume of queries over time during a load test.
Streaming service queries per second metric after the load test

In this graph, the doom is shaded purple. 

Load testing answers a few questions immediately, among others:

  • Do infrastructure constraints affect service uptime? 
  • Does the service’s underlying code gracefully handle memory/CPU constraints?
  • Are the raised service alarms expected?
  • Do you know what to do in the event of every fired alarm?

If any of the answers to these questions leave us unsatisfied, the reliability roadmap writes itself: we need to engineer our way into satisfactory answers to those questions. That leads us to the next step. 

4. Confirm Mitigation Strategies Are Up-to-Date

A service’s reliability depends on the speed at which it can recover from interruption. Whether that recovery is performed by a machine or human doesn’t matter when your CTO is staring at a service’s reliability metrics! After deliberately constraining resources, the operations channel turns into a (controlled) hellscape and it's time to act as if it were a real production incident.

Talking about mitigation strategy could be a blog post on its own, but here are the tenets we found most important:

  1. Every alert must be directly actionable. Just saying “the curtains are on fire!” without mentioning “put it out with the extinguisher!” amounts to noise.
  2. Assume that mitigation instructions will be read by someone broken out of a deep sleep. Simple instructions are carried out the fastest.
  3. If there is any ambiguity or unexpected behavior during controlled load tests, you’ve identified new reliability risks. Your service is less reliable than you expected. For Tier 1 services, that means everything else drops and those risks should be addressed immediately.
  4. Plan another controlled load test and ensure you’re confident in your recovery.
  5. Always over-communicate, even if acting alone. Other engineers will devote their brain power to your struggle.

5. Turn the Knobs Back

Now that we know what can happen with an overburdened infrastructure, we can make an informed decision whether the service carries real throughput risk. If we absolutely hammered the service and it skipped along smiling without risking its primary objective, we can leave it alone (or even scale down, which will have the CFO smiling too).

If we don’t feel confident in our ability to recover, we’ve unearthed new risks. The service’s development team can use this information to plan resiliency projects, and we can collectively scale our infrastructure to minimize throughput risk in the interim.

In general, to be prepared infrastructure-wise to cover our capacity, we perform capacity planning. You can learn more about Shopify’s BFCM capacity planning efforts on the blog.

Overall, we concluded from our results that:

  • Our mitigation strategy for Longboat and Reportify was healthy, needing gentle tweaks to our load-balancing maneuvers.
  • We should scale up our clusters to handle the increased load, not only from shoppers, but also from some of our own fun stuff like the BFCM Live Map.
  • We needed to tune our systems to make sure our merchants could track their online store’s performance in real-time through the Live View in the analytics section of their admin.
  • Some jobs could use some tuning, and some of their internal queries could use optimization.

Most importantly, we refreshed our understanding of data service reliability. Ideally, it’s not any more exciting than that. Boring reliability studies are best.

We hope to perform these exercises more regularly in the future, so BFCM preparation isn’t particularly interesting. In this post we talked about throughput risk as one example, but there are other risks to data integrity, correctness, latency. We aim to get out in front of them too because data grows faster than engineering teams do. “Trillions of records every month” turns into “quadrillions” faster than you expect.

So, How’d It Go?

After months of rigorous preparation systematically improving our indices, schemas, query engines, infrastructure, dashboards, playbooks, SLOs, incident handling and alerts, we can proudly say BFCM 2020 went off without a hitch!

During the big moment we traced down every spike, kept our eyes glued to utilization graphs, and turned knobs from time to time, just to keep the margins fat. There were only a handful of minor incidents that didn’t impact merchants, buyers or internal teams - mainly self healing cases thanks to the nature of our platform and our spare capacity.

This success doesn’t happen by accident, it happens because of diligent planning, experience, curiosity and—most importantly—teamwork.

Arbab is a seven-year veteran at Shopify serving as Reliability Engineering lead. He's previously helped launch Shopify payments, some of the first Shopify public APIs, and Shopify's Retail offerings before joining the Data Platform. 99% of Shopifolk joined after him!
Bruno is a DPE TPM working with the Site Reliability Engineering team. He has a record of 100% successful BFCMs under his belt and plans to keep it that way.

Interested in helping us scale and tackle interesting problems? We’re planning to double our engineering team in 2021 by hiring 2,021 new technical roles. Learn more here!

Continue reading

The State of Ruby Static Typing at Shopify

The State of Ruby Static Typing at Shopify

Shopify changes a lot. We merge around 400 commits to the main branch daily and deploy a new version of our core monolith 40 times a day. The Monolith is also big: 37,000 Ruby files, 622,000 methods, more than 2,000,000 calls. At this scale with a dynamic language, even with the most rigorous review process and over 150,000 automated tests, it’s a challenge to ensure everything works properly. Developers benefit from a short feedback loop to ensure the stability of our monolith for our merchants.

Since 2018, our Ruby Infrastructure team has looked at ways to make the development process safer, faster, and more enjoyable for Ruby developers. While Ruby is different from other languages and brings amazing features allowing Shopify to be what it is today, we felt there was a feature from other languages missing: static typing.

Shipit! Presents: The State of Ruby Static Typing at Shopify

On November 25, 2020, Shipit!, our monthly event series, presented The State of Ruby Static Typing at Shopify. Alexandre Terrasa and I talked about the history of static typing at Shopify and our adoption of Sorbet.

We weren't able to answer all the questions during the event, so we've included answers to them below.

What are some challenges with adopting Sorbet? What was some code you could not type?

So far most of our problems are in modules that use ActiveSupport::Concern (many layers deep, even) and modules that assume they will be included in a certain kind of class, but have no way of making that explicit. For example, a module that assumes it will be included in an ActiveRecord model could be calling before_save to add a hook, but Sorbet would have no idea where before_save is defined. We are also looking to make those kinds of dependencies between modules and include sites explicit in Sorbet.

Inclusion requirements is also something we’re trying to fix right now, mostly for our helpers. The problem is explained in the description of this pull-request: https://github.com/sorbet/sorbet/pull/3409.

If a method has an array in argument, do you have to specify it is an array of what type? And if not, how do Sorbet makes the method you call on the array's element exists?

It depends on the type of elements inside the array. For simple types like Integer or Foo, you can easily type it as T::Array[Integer] and Sorbet will be able to type check method calls. For more complex types like arrays containing hashes it depends, you may use T::Array[T.untyped] in which case Sorbet won’t be able to check the calls. Using T.untyped you can go as deep and precise you want it to be: T::Array[T::Hash[String, T.untyped]], T::Array[T::Hash[String, T::Array[T.untyped]]], T::Array[T::Hash[String, T::Array[Integer]]] and Sorbet will check the calls on what it knows about. Note that as your type becomes more and more complex, maybe you should start thinking about making a class about it so you can just use T::Array[MyNewClass].

How would you compare the benefits of Sorbet to Ruby relative to the benefits of Typescript to Javascript?

There are similar benefits, but Ruby is a much more dynamic language than JavaScript and Sorbet is a much younger project than TypeScript. So the coverage of Ruby features and expressiveness of the type system of Sorbet lags behind the same benefits that TypeScript brings to JavaScript.On the other hand, Sorbet annotations are pure Ruby. That means you don’t have to learn a new language and you can keep using your existing editors and tooling to work with it. There is also no compilation of Ruby code with types to plain Ruby, like how you need to compile TypeScript to JavaScript. Finally, Sorbet also has a runtime type-checker and it can verify your types and alert you if they don’t check when your application is running, which is a great additional safety that TypeScript does not have.

Could you quickly relate Sorbet with RBS and what is the future of sorbet after Ruby 3.0?

Stripe gave an interesting answer to this question: https://sorbet.org/blog/2020/07/30/ruby-3-rbs-sorbet. RBS is about the language to write the signatures, you still need a type checker to check those signatures against your code. We see Sorbet as one of the solution that can use those types, and currently it’s the fastest solution. One limitation of RBS is the lack of inline type annotations, for example there is no syntax to cast a variable to another type. So type checkers have to use additional syntax to make this possible. Even if Sorbet doesn’t support RBS at the moment, it might in the future. And in the case it never happens, remember that it’s easier to go from one type specification to another rather than an untyped codebase to a typed one. So all the efforts are not lost.

Does Tapioca support Enums etc?

Tapioca is able to generate RBI files for T::Enum definitions coming from gems. It can also generate method definitions for the ActiveRecord enums as DSL generators.

In which scenarios would you NOT use Sorbet?

I guess the one scenario where using Sorbet would be counterproductive is if there is no team buy-in for adopting Sorbet. If I were on such a team and I couldn’t convince the rest of the team of the utility of it, I would not push to use Sorbet.Other than that, I can only think of a code base that has a LOT of metaprogramming idioms to be a bad target for using Sorbet. You would still get some benefits from even running Sorbet at typed: false but it might not be worth the effort.

What editors do you personally use? Any standardization across the organization?

We have not standardized on a single editor across the company and we probably will not do so, since we believe in developers’ freedom to use the tools that make them the most productive. However, we also cannot build tooling for all editors, either. So, most of our developer acceleration team builds tooling primarily for VSCode, today. Ufuk personally uses VSCode and Alex uses VIM.

Is there a roadmap for RBI -> RBS conversion for when Ruby 3.0 comes out?

No official roadmap yet, we’re still experimenting with this on the side of our main project: 100% typed: true files in our monolith. We can already say that some features from RBS will not directly translate to RBI and vice versa. You can see a comparison of both specifications here: https://github.com/Shopify/rbs_parser#whats-supported (might not be completely up-to-date with the latest version of RBS).

What are the major challenges you had or are having with Ruby GraphQL libraries?

Our team tried to marry the GraphQL typing system and the Sorbet types using RBI generation but we got stuck in some very dynamic usages of GraphQL resolvers, so we paused that work for now. On the other hand, there are teams within Shopify who have been using Sorbet and GraphQL together by changing the way they write GraphQL endpoints. You can read more about the technical details of that from the blog post of one of the Shopify engineers that has worked on that: https://gmalette.dev/posts/graphql-and-sorbet-and-unit-tests/.

What would be the first step in getting started with typing in a Rails project? What are kind of files that should be checked in to a repo?

The fastest way to start is to use the steps listed on the Sorbet site to start running with Sorbet. After doing that, you can take a look at using sorbet-rails to generate Rails RBI files for you, or you can look at tapioca to generate gem RBIs. Since you can go gradual it’s totally up to you.

Our advice would be to first target all files at typed: false. If you use tapioca, the price is really low and already brings a lot of benefit. Then try to move the files to type: true where it does not create new type errors (you can use spoom for that: https://github.com/Shopify/spoom#change-the-sigil-used-in-files).

When it comes to adding signatures, prefer the files that are the most reused or, if you track the errors from production, going first with the files that create the most errors might be a good choice. Files touched by many teams are also an interesting target as signatures make collaboration easier. Files with a lot of churn. Or files defining methods reused a lot across your codebase.

As for the files that you need to check-in to a repository, the best practice is to check-in all the files (mostly RBI files) generated by Sorbet and/or tapioca/sorbet-typed. Those files enable the code to be type checked, so should be available to all the developers that work on the code base.

Learn More About Ruby Static Typing at Shopify

Additional Information

Open Source


We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Continue reading

Organizing 2000 Developers for BFCM in a Remote World

Organizing 2000 Developers for BFCM in a Remote World

Shopify is an all-in-one commerce platform that serves over 1M+ merchants in approximately 175 countries across the world. Many of our merchants prepare months in advance for their biggest shopping season of the year, and they trust us to help them get through it successfully. As our merchants grow and their numbers increase, we must scale our platform without compromising on our stability, performance, and quality.  

With Black Friday and Cyber Monday (BFCM) being the two biggest shopping events of the year and with other events on the horizon, there is a lot of preparation that Shopify needs to do on our platform. This effort needs a key driver to set expectations for many teams and hold them accountable to complete the work for their area in the platform. 

Lisa Vanderschuit getting hyped about making commerce better for everyone
Lisa Vanderschuit getting hyped about making commerce better for everyone

I’m an Engineering Program Manager (EPM) with a focus on platform quality and was one of the main program managers (PgM) tapped midway in the year to drive these efforts. For this initiative I worked with three Production Engineering leads (BFCM leads) and three other program managers (with a respective focus in resiliency, scale, and capacity) to:

  • understand opportunities for improvement
  • build out a program that’s effective at scale
  • create adjustments to the workflow specifically for BFCM
  • execute the program
  • start iterating on the program for next year.

Understanding our Opportunities

Each year, the BFCM leads start a large cross company push to get the platform ready for BFCM. They ask the teams responsible for critical areas of the platform to complete the following prep:

Looking at the past years, the BFCM leads chosen to champion this in spend a significant time on administrative, communication, and reporting activities when their time is better spent in the weeds of the problems. Our PgM group was assigned to take on these responsibilities so that these leads could focus on investigating the technical challenges and escalations.

Before jumping into solutions, our PgM group looked into the past to find lessons to inform the future. In looking at past retrospective documents we found some common themes over the years that we needed to keep in mind as we put together our plan:

  • Shopify needs to prepare in advance for supporting large merchants with lots of popular inventory to sell. 
  • Scaling trends weren’t just on the two main days. Sales were spreading out through the week, and there were pre sale and post sale workflows where we needed to be well tested for how much load we could sustain without performance issues. 
  • There were some parts of the platform tied to disruptions in the past that would require additional load testing to give us more confidence in their stability. 

With Shopify moving to Digital by Default and the increasing number of timezones to consider, there were more complexities to getting the company aligned to the same goals and schedule. Our PgM group wanted to create structure around coordinating a large scale effort, but we also wanted to start thinking about how maintenance and prep work can be done throughout the year, so we’re ready for any large shopping event regardless of the time of year. 

Building the Program Plan

Our PgM group listed all the platform preparation tasks the BFCM leads asked developers to do in the past. Then we highlighted items that had to happen this year and took note of when they needed to happen. After this, we asked the BFCM leads to highlight the important things critical for their participation and then we assigned the rest of the work for our PgM group to manage. 

Example of our communication plan calendar
Example of our communication plan calendar

Once we had those details documented, we created a communication plan calendar (a.k.a spreadsheet) to see what was next, week over week. We split the PgM work into workstreams then we each selected ones respective to our areas of focus. In my workstream I had two main responsibilities:

  • Put together a plan to get people assigned to do the platform preparation work that the BFCM leads wanted them to. 
  • Determine what kind of PRs should or should not be shipped to production in the month before and after BFCM.

For platform preparation work listed earlier, I asked teams to identify which areas of the platform that need prepping for the large shopping event. Even with a reduced set of areas to focus on there were still quite a bit of people that I would need to get this prep work assigned to. Instead of working directly with every single person, I used a distributed ownership model. I asked each GM or VP with critical areas to assign a champion from their department to work with. Then I reached out to the champions to let them know of the prep work that needed to be done. They then either assigned the work themselves or they assigned people from their team. To keep track of this ownership I built a tracking spreadsheet and set up a schedule to report on progress week over week.

In the past, our Deploys team would lock the ability to automatically deploy to production for a week. Since BFCM is becoming more spread out year after year, we realized we needed to adjust our culture around shipping in the last two months of the year to make sure we could confidently provide merchants with a resilient platform. Merchants were also needing to train up staff further in advance of the year so we also had to consider slowing down new features that could require extra training for their staff. To start tackling these challenges I asked:

  • Teams to take inventory of all of our platform areas and highlight which areas were considered critical for the merchant experience. 
  • That we set up a rule in a bot we call Caution Tape to comment a thorough risk to value assessment on any new PRs created between November to December in repos that had been flagged as critical to successful large shopping events. 

If the PRs were proposing a merchant facing feature the Caution Tape bot message asked that they document the risks vs the value to shipping around BFCM and that they only ship if approved by a director or GM in their area. In many cases the people creating these PRs either investigated a safer approach, got more thorough reviews, or decided to wait until next year to launch the feature. 

On the week of Black Friday 60% of the work in GitHub was code reviews
On the week of Black Friday 60% of the work in GitHub was code reviews

To artificially slow down the rate of items being shipped to production we planned to reduce the amount of PRs that could be shipped in a deploy and increase the amount of time they would spend in canaries (pre-production deploy test). On top of this we also planned to lock deploys for a week around the BFCM weekend. 

Executing the Program

How do you rally 2500+ people around a mission who are responsible for 1000+ deploys across all services? You state your expectations and then repeat many times in different ways. 

1. Our PGM and BFCM lead group had two main communication options where we started engagement with the rest of the engineering group working on platform prep:

  • Shared Slack channels for discussions, questions, and updates.
  • GitHub repos for assigning work.

2. Our PGM group put together and shared internal documentation on the program details to make it easier to onboard participants to the program.

3. Our PGM group shared high-level announcements, reminders and presentations throughout the year, increasing in frequency leading up to BFCM, to increase awareness and engagement. Some examples of this were:

  • progress reports on each department posted in Slack. 
  • live-Streamed and recorded presentations on our internal broadcasting channel to inform the company about our mission and where to go for help.
  • emails sent to targeted groups to remind people of their respective responsibilities and deadlines.
  • GitHub issues created and assigned to teams with a checklist of the prep work we had asked them to do.

To make sure our BFCM leads had the support they needed our PgM group had regular check-in meetings with them to get a pulse on how they were feeling things were going. To make sure our PgM group was on top of the allocated tasks each week we had meetings at the start and end of each week. Then we hosted office hours along with the BFCM leads for any developers that wanted facetime to flag any potential concerns about their area.

Celebrations and Lessons Learned

Overall I’d say our program was a success. We had a very successful BFCM with sales of $5.1+ billion from the more than one million Shopify-powered brands around the world. We found that our predictions for which areas would take the most load were on target and that the load testing and spinning up of resources paid off. 

A photo of 3 women and 2 men celebrating. Gold confetti showers down on them.
Celebrating BFCM

From our internal developer view we had success in the sense that shipping to the platform didn’t need to come to a full stop. PR reviews were at an all time high which meant that developers focus on quality was at an all time high. For the areas where we did have to slow down on shipping code for features we found that our developers had more time to work on the other important aspects of work that needs to be done in engineering. Teams were able to

  • focus more on clean up tasks
  • write blog posts
  • put together strategic roadmaps and architecture design docs
  • plan team building exercises. 

Overall we still did take a hit in developer productivity and we could have been a bit more relaxed on how long we enforced the extra risk to value assessment on our PRs and expectations on deploying to production. Our PGM team hopes to find a more balanced approach for this for next year's plan. 

From a communication standpoint, some of the messaging to developers was inconsistent on whether or not they could ship to critical areas of the platform during November and December. Our PGM group also ended up putting together some of the announcement drafts last minute so in future years we want to have this included in our communication plan from the start with templates ready to go. 

Our PGM group is hoping to have a retrospective meeting later this year with the BFCM leads to see how we can adjust the program plan for next year. We will be taking everything we learned and find opportunities where we can automate some of the work or distribute the work throughout the year so we can be always ready for any large shopping event in the year. 

If you have a large initiative at your company, consider creating a role for people technical enough to be dangerous that can help drive engineering initiatives forward and work with your top developers to maximise their time and expertise to solve the big complex problems and get shit done.

Lisa Vanderschuit is an Engineering Program Manager who manages the engineering theme of Code Quality. She has been at Shopify for 6 years, working on areas from editing and reviewing Online Store themes to helping our engineering teams raise the bar of code quality at Shopify.

How does your team leverage program managers at your company? What advice do you have for coordinating cross company engineering initiatives? We want to hear from you on Twitter at @ShopifyEng.


We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Continue reading

A World Rendered Beautifully: The Making of the BFCM 3D Data Visualization

A World Rendered Beautifully: The Making of the BFCM 3D Data Visualization

By Mikko Haapoja and Stephan Leroux

2020 Black Friday Cyber Monday (BFCM) is over, and another BFCM Globe has shipped. We’re extremely proud of the globe, it focused on realism, performance, and the impact our merchants have on the world.

The Black Friday Cyber Monday Live Map

We knew we had a tall task in front of us this year, building something that could represent orders from our one million merchants in just two months. Not only that, we wanted to ship a data visualization for our merchants so they could have a similar experience to the BFCM globe every day in their Live View.

Prototypes for the 2020 BFCM Globe and Live View. **

With tight timelines and an ambitious initiative, we immediately jumped into prototypes with three.js and planned our architecture.

Working with a Layer Architecture

As we planned this project, we converged architecturally on the idea of layers. Each layer is similar to a React component where state is minimally shared with the rest of the application, and each layer encapsulates its own functionality. This allowed for code reuse and flexibility to build both the Live View Globe, BFCM Globe, and beyond.

A showcase of layers for the 2020 BFCM Globe. **

When realism is key, it’s always best to lean on fantastic artists, and that’s where Byron Delgado came in. We hoped that Byron would be able to use the 3D modeling tools he’s used to, and then we would incorporate his 3D models into our experience. This is where the EarthRealistic layer comes in.

EarthRealistic layer from the 2020 BFCM Globe. **

EarthRealistic uses a technique called physically based rendering, which most modern 3D modeling software supports. In three.js, physically based rendering is implemented via the MeshPhysicalMaterial or MeshStandardMaterial materials.

To achieve realistic lighting, EarthRealistic is lit by a 32bit EXR Environment Map. By using a 32bit EXR, it means we can have smooth image based lighting. Image based lighting is a technique where a “360 sphere” is created around the 3D scene, and pixels in that image are used to calculate how bright Triangles on 3D models should be. This allows for complex lighting setups without much effort from an artist. Traditionally images on the web such as JPGs and PNGs have a color depth of 8bits. If we were to use these formats and 8bit color depth, our globe lighting would have had horrible gradient banding, missing realism entirely.

Rendering and Lighting the Carbon Offset Visualization

Once we converged on physically based rendering and image based lighting, building the carbon offset layer became clearer. Literally!

Carbon Offset visualization layer from the 2020 BFCM Globe. **

Bubbles have an interesting phenomenon where they can be almost opaque at a certain angle and light intensity but in other areas completely transparent. To achieve this look, we created a custom material based on MeshStandardMaterial that reads in an Environment Map and simulates the bubble lighting phenomenon. The following is the easiest way to achieve this with three.js:

  1. Create a custom Material class that extends off of MeshStandardMaterial.
  2. Write a custom Vertex or Fragment Shader and define any Uniforms for that Shader Program.
  3. Override onBeforeCompile(shader: Shader, _renderer: WebGLRenderer): void on your custom Material and pass the custom Vertex or Fragment Shader and uniforms via the Shader instance.

Here’s our implementation of the above for the Carbon Offset Shield Material:

Let’s look at the above, starting with our Fragment shader. In shield.frag lines 94-97

These two lines are all that are needed to achieve a bubble effect in a fragment shader.

To calculate the brightness of an rgb pixel, you calculate the length or magnitude of the pixel using the GLSL length function. In three.js shaders, outgoingLight is an RGB vec3 representing the outgoing light or pixel to be rendered.

If you remember from earlier, the bubble’s brightness determines how transparent or opaque it should appear.  After calculating brightness, we can set the outgoing pixel’s alpha based on the brightness calculation. Here we use the GLSL mix function to go between the expected alpha of the pixel defined by diffuseColor.a and a new custom uniform defined as maxOpacity. By having the concept of min or expected opacity and max opacity, Byron and other artists can tweak visuals to their exact liking.

If you look at our shield.frag file, it may seem daunting! What on earth is all of this code?  three.js materials handle a lot of functionality, so it’s best to make small additions and not modify existing code. three.js materials all have their own shaders defined in the ShaderLib folder. To extend a three.js material, you can grab the original material shader code from the src/renderers/shaders/ShaderLib/ folder in the three.js repo and perform any custom calculations before setting gl_FragColor. An easier option to access three.js shader code is to simply console.log the shader.fragmentShader or shader.vertexShader strings, which are exposed in the onBeforeCompile function:

onBeforeCompile runs immediately before the Shader Program is created on the GPU. Here you can override shaders and uniforms. CustomMeshStandardMaterial.ts is an abstraction we wrote to make creating custom materials easier. It overrides the onBeforeCompile function and manages uniforms while your application runs via the setCustomUniform and getCustomUniform functions. You can see this in action in our custom Shield Material when getting and setting maxOpacity:

Using Particles to Display Orders

Displaying orders on Shopify from across the world using particles. **

One of the BFCM globe’s main features is the ability to view orders happening in real-time from our merchants and their buyers worldwide. Given Shopify’s scale and amount of orders happening during BFCM, it’s challenging to visually represent all of the orders happening at any given time. We wanted to find a way to showcase the sheer volume of orders our merchants receive over this time in both a visually compelling and performant way. 

In the past, we used visual “arcs” to display the connection between a buyer’s and a merchant’s location.

The BFCM Globe from 2018 showing orders using visual arcs.
The BFCM Globe from 2018 showing orders using visual arcs.

With thousands of orders happening every minute, using arcs alone to represent every order quickly became a visual mess along with a heavy decrease in framerate. One solution was to cap the number of arcs we display, but this would only allow us to display a small fraction of the orders we were processing. Instead, we investigated using a particle-based solution to help fill the gap.

With particles, we wanted to see if we could:

  • Handle thousands of orders at any given time on screen.
  • Maintain 60 frames per second on low-end devices.
  • Have the ability to customize style and animations per order, such as visualizing local and international orders.

From the start, we figured that rendering geometry per an order wouldn't scale well if we wanted to have thousands of orders on screen. Particles appear on the globe as highlights, so they don’t necessarily need to have a 3D perspective. Rather than using triangles for each particle, we began our investigation using three.js Points as a start, which allowed us to draw using dots instead. Next, we needed an efficient way to store data for each particle we wanted to render. Using BufferGeometry, we assigned custom attributes that contained all the information we needed for each particle/order.

To render the points and make use of our attributes, we created a ShaderMaterial, and custom vertex and fragment shaders. Most of the magic for rendering and animating the particles happens inside the vertex shader. Each particle defined in the attributes we pass to our BufferGeometry goes through a series of steps and transformations.

First, each particle has a starting and ending location described using latitude and longitude. Since we want the particle to travel along the surface and not through it, we use a geo interpolation function on our coordinates to find a path that goes along the surface.

A photo of a globe with an order represented as a particle traveling from New York City to London. The vertex shader uses each location’s latitude and longitude and determines the path it needs to travel.
An order represented as a particle traveling from New York City to London. The vertex shader uses each location’s latitude and longitude and determines the path it needs to travel. **

Next, to give the particle height along its path, we use high school geometry, a parabola equation based on time to alter the straight path to a curve.

A photo of a globe with particles that follow a curved path away from the earth’s surface using a parabola equation to determine its height.
Particles follow a curved path away from the earth’s surface using a parabola equation to determine its height. **

To render the particle to make it look 3D in its travels, we combine our height and projected path data then convert it to a vector position our shader uses as it’s gl_Position. With our particle now knowing where it needs to go, using a time uniform, we drive animations for other changes such as size and color. At the end of the vertex shader, we pass the position and point size to render onto the fragment shader that combines the calculated color and alpha at the time for each particle.

Once the vertex shader is complete, the vertex shader passes position and point size onto the fragment shader that combines the animated color and alpha for each particle.

Given that we wanted to support updating and animating thousands of particles at any moment, we wanted to be careful about how we access and update our attributes. For example, if we had 10000 particles in transit, we need to continue updating those and other data points that are coming in. Instead of updating all of our attributes every time, which can be processor-intensive, we made use of BufferAttribute’s updateRange to update a subset of the attributes we needed to change on each frame instead of the entire attribute set.

gl_Points enables us to render 150,000 particles flying around the globe at any given time without performance issues. **

Combining all of the above, we saw upwards of 150,000 particles animating to and from locations on the globe without noticing any performance degradation.

Optimizing Performance

In video games, you may have seen settings for different quality levels. These settings modify the render quality of the application. Most modern games will automatically scale performance. Most aggressively, the application may reduce texture quality or how many vertices are rendered per 3D object.

With the amount of development time we had for this project, we simply didn’t have time to be this aggressive. Yet, we still had to support old, low-power devices such as dated mobile phones. Here’s how we implemented an auto optimizer that could increase an iPhone 7+ render performance from 40 frames per second (fps) to a cool 60fps.

If your application isn’t performing well, you might see a graph like this:

Graph depicting the Globe application running at 40 frames per second on a low power device
Graph depicting the Globe application running at 40 frames per second on a low power device

Ideally, in modern applications, your application should be running at 60fps or more. You can also use this metric to determine when you should lower the quality of your application. Our initial implementation plan was to keep it simple and make every device with a low-resolution display run in low quality. However, this would mean new phones with low-resolution displays and extremely capable GPUs would receive a low-quality experience. Our final attempt monitors fps. If it’s lower than 55fps for over 2 seconds, we decrease the application’s quality. This adjustment allows phones such as the new iPhone 12 Pro Max to run in the highest quality possible while an iPhone 7+ can render at lower quality but consistent high framerate. Decreasing the quality of an application by reducing buffer sizes is optimal. However, in our aggressive timeline, this would have created many bugs and overall application instability.

Left side of the image depicts the application running in High-Quality mode, where the right side of the image depicts the application running in Low-Quality mode
Left side of the image depicts the application running in High-Quality mode, where the right side of the image depicts the application running in Low-Quality mode. **

What we opted for instead was simple and likely more effective. When our application retains a low frame rate, we simply reduce the size of the <canvas> HTML element, which means we’re rendering fewer pixels. After this, WebGL has to do far less work, in most cases, 2x or 3x less work. When our WebGLRenderer is created, we setPixelRatio based on window.devicePixelRatio. When we’ve retained a low frame rate, we simply drop the canvas pixel ratio back down to 1x. The visual differences are nominal and mainly noticeable in edge aliasing. This technique is simple but effective. We also reduce the resolution of our Environment Maps generated by PMREMGenerator, but most applications will be able to utilize the devicePixelRatio drop more effectively.

If you’re curious, this is what our graph looks like after the Auto Optimizer kicks in (red circled area)

Graph depicting the Globe application running at 60 frames per second on a low power device with a circle indicating when the application quality was reduced.
Graph depicting the Globe application running at 60 frames per second on a low power device with a circle indicating when the application quality was reduced

Globe 2021

We hope you enjoyed this behind the scenes look at the 2020 BFCM Globe and learned some tips and tricks along the way. We believe that by shipping two globes in a short amount of time, we were able to focus on the things that mattered most while still keeping a high degree of quality. However, the best part of all of this is that our globe implementation now lives on as a library internally that we can use to ship future globes. Onward to 2021!

*All data is unaudited and is subject to adjustment.
**Made with Natural Earth; textures from Visible Earth NASA

Mikko Haapoja is a development manager from Toronto. At Shopify he focuses on 3D, Augmented Reality, and Virtual Reality. On a sunny day if you’re in the beaches area you might see him flying around on his OneWheel or paddleboarding on the lake.

Stephan Leroux is a Staff Developer on Shopify's AR/VR team investigating the intersection of commerce and 3D. He has been at Shopify for 3 years working on bringing 3D experiences to the platform through product and prototypes.

Additional Information

three.js

OpenGL


We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM, and commerce isn't slowing down. Help us scale & make commerce better for everyone

Continue reading

Capacity Planning at Scale

Capacity Planning at Scale

By Kathryn Tang and Kir Shatrov

The fourth Thursday in November is Thanksgiving in the United States. The day after, Black Friday (coined in 1961), is the first day of the Christmas shopping season and since 2005 it’s the busiest shopping day of the year in North America. Cyber Monday is a more recent development. Getting its name in 2005, it refers to the Monday after the Thanksgiving weekend where retailers focus on sales offered online. At Shopify, we call the weekend including Black Friday and Cyber Monday BFCM.

From the engineering team’s point of view, every BFCM challenges the platform and all the things we’ve shipped throughout the year:

  • Would our clusters handle two times the number of virtual machines? 
  • Would we hit some sort of limitation on the new network design? 
  • Would the new logging pipeline handle such an increase in traffic? 
  • What’s going to be the next scalability bottleneck that we hit?

The other challenge is planning the capacity. We need to understand the magnitude of traffic ahead of us, and how many resources like CPUs and storage we’ll need to handle BFCM sales. On top of that, we need to have enough room in case of something unexpected, and we need to perform a regional failover.

Since 2017, we’ve partnered with Google Cloud Platform (GCP) as our main vendor for the cloud. Over these years, we’ve worked closely with their team on our capacity models, and prior to every BFCM that collaboration gets even closer.

In this post, we’ll cover our approaches to capacity planning, and how we rolled it out across the org and to dozens of teams. We’ll also share how we validated our capacity plans with scalability tests to make sure they work.

Capacity Planning 

Our Google Cloud resource needs depend on how much traffic our merchants see during BFCM. We worked with our data scientists to forecast traffic levels and set those levels as a bar for our platform to scale to. Additionally, we looked into historical numbers, applied a safety margin, and projected how many buyers would check out or view online stores.

A list of GCP projects for resource planning.  The list includes items like memcache, Kafka, MySQL, etc
A list of GCP projects for resource planning.

We created a master resourcing plan for our Google Cloud implementation and estimated how things like CPUs and storage would scale to BFCM traffic levels. Owners for our top 10 or so resource areas were tasked to estimate what they needed for BFCM. These estimates were detailed breakdowns of the machine types, geographic locations, and quantities of resources like CPUs. We also added buffers to our overall estimates to allow flexibility to change our resourcing needs, move machines across projects, or failover traffic to different regions if we needed to. What also helps is that we partition each component into a separate GCP project, which makes it a lot easier to think of quotas per every project.

A line graph showing the BFCM traffic forecasts over time. 4 different scenarios are shown in blue, red, yellow and green. The black line shows existing traffic patterns
A line graph showing the BFCM traffic forecasts over time

2020 is an exceptionally difficult year to plan for. Normally, we’d look at BFCM trends from years prior and predict BFCM traffic with a fairly high level of confidence. This year, COVID-19 lockdowns drove a rapid shift to selling online this spring, and we didn't know what to expect. Would we see a massive increase in online traffic this BFCM, or a global economic depression where consumers stopped buying much at all? To manage heightened uncertainty, we forecasted multiple scenarios and their respective needs for our cloud deployment.

From an investment perspective, planning for the largest scale scenario means spending a lot of money very quickly to handle sales that might not happen. Alternatively, not deploying enough machines means having too little computing power and putting our merchant storefronts at risk of outages. It was absolutely vital to avoid anything that would put our merchants at risk of downtime. We decided to scale to our more aggressive growth scenarios to ensure our platform is stable regardless of what happens. We’re transparent with our partners, finance teams, and internal teams about how we thought through these scenarios which helps them make their own operating decisions.

Scalability Testing

A sheet with a capacity plan is just a starting point. Once we start scaling to projected numbers, there’s a high chance that we’ll hit limits throughout our tech stack that need resolving. In a complex system, there’s always a limit like:

  • the number of VMs in a network
  • the number of packets that a busy Memcached server can accept 
  • the number of MB/s your logging pipeline can handle.

Historically, every BFCM brought us some scalability surprises, and what’s worse, we’d only notice them when fully scaled prior to BFCM. That left too little time to come up with mitigation plans.

Back in 2018, we decided that a “faux” BFCM in the middle of the year would increase our resilience as an organization and push us to find unknowns that we’d otherwise only discover during the real thing. As we started doing that, it allowed us to find problems at scale more often and created that mental muscle of preparing for critical events and finding unknowns. If you’re exercising and something feels hard, you train more and eventually your muscles get better. Shopify treats BFCM the same way.

We’ve started the practice of regular scale-up testing at Shopify, and of course we made sure to come up with fun names for each. We’ve had Mayday (2019), Spooky scale-up (2019), and Oktoberfest scale-up (2020). Another fun fact is that our Waterloo teams play a large part in running this testing, and the dates of our Oktoberfest matched the city of Kitchener-Waterloo’s Oktoberfest festivities (It’s the second-largest Oktoberfest in the world).

Oktoberfest scale-up’s goal was to simulate this year’s expected BFCM load based on the traffic forecasts from the data science team. And the fact that we run Shopify in cloud on Google Kubernetes Engine allowed us to grab extra compute capacity just for the window of the exercise, and only pay for those hours when we needed it.

Investment in our internal load testing tooling over the years is fundamental to our ability to run such large scale, platform-wide load tests. We’ve talked about go-lua, an open source project that powers our load testing tool. Thanks to embedded Lua, we feed it with a high-level set of steps for what we want to test: actions like browsing the storefront, adding a product to a card, proceeding to check out, and processing the transaction through a mock payment gateway.

Thanks to Oktoberfest scale-up, we identified and then fixed some bottlenecks that could have become an issue for the real BFCM. Doing the test in early October gave us time to address issues.

After addressing all the issues, we repeated the scale-up test to see how our mitigations helped. Seeing that going smoothly increased our confidence levels about the upcoming Black Friday and reduced stress levels for all teams.

We strive for a smooth BFCM and spend a lot of time preparing for it, from capacity planning, to setting the expectations for our vendors, to load testing, and failover simulations. Beyond delivering a smooth holiday season for our merchants, BFCM is time to reflect on the future. As Shopify continues to grow, BFCM traffic levels can become the normal everyday loads we see in the next year. Our job is to bring lessons from events like BFCM to make our systems even more automated, more dynamic, and more resilient. We relish this opportunity to think about where Shopify is going and to architect our platform to scale with it.

Kir Shatrov is an Engineering Lead who’s been with Production Engineering at Shopify for the past five years, working on areas from CI/CD infrastructure to sharding and capacity planning.

Kathryn Tang is an Engineering Program Manager who manages our Google Cloud relationship. She has been at Shopify for 4 years, working with a multitude of R&D and commercial teams to derive business insights and guide operating decisions to help us scale.


We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone

Continue reading

Pummelling the Platform–Performance Testing Shopify

Pummelling the Platform–Performance Testing Shopify

Developing a product or service at Shopify requires care and consideration. When we deploy new code at Shopify, it’s immediately available for merchants and their customers. When over 1 million merchants rely on Shopify for a successful Black Friday Cyber Monday (BFCM), it’s extremely important that all merchants—big and small—can run sales events without any surprises.

We typically classify a “sales event” as any single event that attracts a high amount of buyers to Shopify merchants. This could be a product launch, a promotion, or an event like BFCM. In fact, BFCM 2020 was the largest single sales event that Shopify has ever seen, and many of the largest merchants on the planet also saw some of the biggest flash sales ever observed before on Earth.

In order to ensure that all sales are successful, we regularly and repeatedly simulate large sales internally before they happen for real. We proactively identify and eliminate issues and bottlenecks in our systems using simulated customer traffic on representative test shops. We do this dozens of times per day using a few tools and internal processes.

I’ll give you some insight into the tools we use to raise confidence in our ability to serve large sales events. I’ll also cover our experimentation and regression framework we built to ensure that we’re getting better, week-over-week, at handling load.

We use “performance testing” as an umbrella term that covers different types of high-traffic testing including (but not limited to) two types of testing that happen regularly at Shopify: “load testing” and “stress testing”. 

Load testing verifies that a service under load can withstand a known level of traffic or specific number of requests. An example load test is when a team wants to confirm that their service can handle 1 million requests per minute for a sustained duration of 15 minutes. The load test will confirm (or disconfirm) this hypothesis.

Stress testing, on the other hand, is when we want to understand the upper limit of a particular service. We do this by increasing the amount of load—sometimes very quicklyto the service being tested until it crumbles under pressure. This gives us a good indication of how far individual services at Shopify can be pushed, in general.

We condition our platform through performance testing on a massive scale to ensure that all components of Shopify’s platform can withstand the rush of customers trying to purchase during sales events like BFCM. Through proactive load tests and stress tests, we have a really good picture of what to expect even before a flash sale kicks off. 

Enabling Performance Testing at Scale

A platform as big and complex as Shopify has many moving parts, and each component needs to be finely tuned and prepared for large sales events. Not unlike a sports car, each individual part needs to be tested under load repeatedly to understand performance and throughput capabilities before assembling all the parts together and taking the entire system for a test drive. 

Individual teams creating services at Shopify are responsible for their own performance testing on the services they build. These teams are best positioned to understand the inner workings of the services they own and potential bottlenecks or situations that may be overwhelmed under extreme load, like during a flash sale. To enable these teams, performance testing needs to be approachable, easy to use and understand, and well-supported across Shopify. The team I lead is called Platform Conditioning, and our mission is to constantly improve the tooling, process, and culture around performance testing at Shopify. Our team makes all aspects of the Shopify platform stronger by simulating large sales events and making high-load events a common and regular occurrence for all developers. Think of Platform Conditioning as the personal trainers of Shopify. It’s Platform Conditioning that can help teams develop individualized workout programs and set goals. We also provide teams with the tools they need in order to become stronger.

Generating Realistic Load

At the heart of all our performance testing, we create “load”. A service at Shopify will add load to to cause stresses that—in the end—make it stronger, by using requests that hit specific endpoints of the app or service.

Not all requests are equal though, and so stress testing and load testing are never as easy as tracking the sheer volume of requests received by a service. It’s wise to hit a variety of realistic endpoints when testing. Some requests may hit CDNs or layers of caching that make the response very lightweight to generate. Other requests, however, can be extremely costly and include multiple database writes, N+1 queries, or other buried treasures. It’s these goodies that we want to find and mitigate up front, before a sales event like BFCM 2020.

For example, a request to a static CSS file is served from a CDN node in 40ms without creating any load to our internal network. Comparatively, making a search query on a shop hits three different layers of caching and queries Redis, MySQL, and Elasticsearch with total round-trip time taking 1.5 seconds or longer.

Another important factor to generating load is considering the shape of the traffic as it appears at our load balancers. A typical flash sale is extremely spiky and can begin with a rush of customers all trying to purchase a limited product simultaneously. It’s very important to simulate this same traffic shape when generating load and to run our tests for the same duration that we would see in the wild.

A flow diagram showing how we generate load with go-lua
A systems diagram showing how we generate load with go-lua

When generating load we use a homegrown, internal tool that generates raw requests to other services and receives responses from them. There are two main pieces to this tool: the first is the coordinator, and the second is the group of workers that generate the load. Our load generator is written in Go and executes small scripts written in Lua called “flows”. Each worker is running a Go binary and uses a very fast and lightweight Lua VM for executing the flows. (The Go-Lua VM that we use is open source and can be found on Github) Through this, the steps of a flow can scale to issue tens of millions of requests per minute or more. This technique stresses (or overwhelms) specific endpoints of Shopify and allows us to conduct formal tests using the generated load.

We use our internal ChatOps tool, ‘Spy’, to enqueue tests directly from Slack, so everyone can quickly see when a load test has kicked off and is running. Spy will take care of issuing a request to the load generator and starting a new test. When a test is complete, some handy links to dashboards, logs, and overall results of the test are posted back in Slack.

Here’s a snippet of a flow, written in Lua, that browses a Shopify storefront and logs into a customer account—simulating a real buyer visiting a Shopify store:

Just like a web browser, when a flow is executing it sends and receives headers, makes requests, receives responses and simulates many browser actions like storing cookies for later requests. However, an executing flow won’t automatically make subsequent requests for assets on a page and can’t execute Javascript returned by the server. So our load tests don’t make any XMLHttpRequest (XHR) requests, Javascript redirects or reloads that can happen in a full web browser.

So our basic load generator is extremely powerful for generating a great deal of load, but in its purest form it only can hit very specific endpoints as defined by the author of a flow. What we create as “browsing sessions” are only a streamlined series of instructions and only include a few specific requests for each page. We want all our performance testing as realistic as possible, simulating real user behaviour in simulated sales events and generating all the same requests that actual browsers make. To accomplish this, we needed to bridge the gap between scripted load generation and realistic functionality provided by real web browsers.

Simulating Reality with HAR-based Load Testing

Our first attempt at simulating real customers and adding realism to our load tests was an excellent idea, but fairly naive when it came to how much computing power it would require. We spent a few weeks exploring browser-based load testing. We researched tools that were already available and created our own using headless browsers and tools like Puppeteer. My team succeeded in making realistic browsing sessions, but unfortunately the overhead of using real browsers dramatically increased both computing costs and real money costs. With browser-based solutions, we could only drive thousands of browsing sessions at a time, and Shopify needs something that can scale to tens of millions of sessions. Browsers provide a lot of functionality, but they come with a lot of overhead. 

After realizing that browser-based load generation didn’t suit our needs, my team pivoted. We were still driving to add more realism to our load tests, and we wanted to make all the same requests that a browser would. If you open up your browser’s Developer Tools, and look at the Network tab while you browse, you see hundreds of requests made on nearly every page you visit. This was the inspiration for how we came up with a way to test using HTTP Archive (HAR) files as a solution to our problems.

An image of chrome developer tools showing a small sample of requests made by a single product page
A small sample of requests made by a single product page

HAR files are detailed JSON representations of all of the network requests and responses made by most popular browsers. You can export HAR files easily from your browser, or web proxy tools like Charles Proxy. A single HAR file includes all of the requests made during a browsing session and are easy to save, examine, and share. We leveraged this concept and created a HAR-based load testing solution. We even gave it a tongue-and-cheek name: Hardy Har Har.

Hardy Har Har (or simply HHH for those who enjoy brevity) bridges the gap between simple, lightweight scripted load tests and full-fledged, browser-based load testing. HHH will take a HAR file as input and extract all of the requests independently, giving the test author the ability to pick and choose which hostnames can be targeted by their load test. For example, we nearly always remove requests to external hostnames like Google Analytics and requests to static assets on CDN endpoints (They only add complexity to our flows and don’t require load testing). The resulting output of HHH is a load testing flow, written in Lua and committed into our load testing repository in Git. Now—literally at the click of a button—we can replay any browsing session in its full completeness. We can watch all the same requests made by our browser, scaled up to millions of sessions.

Of course, there are some aspects of a browsing session that can’t be simply replayed as-is. Things like logging into customer accounts and creating unique checkouts on a Shopify store need dynamic handling that HHH recognizes and intelligently swaps out the static requests and inserts dynamic logic to perform the equivalent functionality. Everything else lives in Lua and can be ripped apart or edited manually giving the author complete control of the behaviour of their load test. 

Taking a Scientific Approach to Performance Testing

The final step to having great performance testing leading up to a sales event is clarity in observations and repeatability of experiments. At Shopify, we ship code frequently, and anyone can deploy changes to production at any point in time. Similarly, anyone can kick off a massive load test from Slack whenever they please. Given the tools we’ve created and the simplicity in using them, it’s in our best interest to ensure that performance testing follows the scientific method for experimentation.

A flow diagram showing the performance testing scientific method of experimentation
Applying the scientific method of experimentation to performance testing

Developers are encouraged to develop a clear hypothesis relating to their product or service, perform a variety of experiments, observe the results of various experiment runs, and formulate a conclusion that relates back to their hypothesis. 

All this formality in process can be a bit of a drag when you’re developing, so the Platform Conditioning team created a framework and tool for load test experimentation called Cronograma. Cronograma is an internal Rails app making it easy for anyone to set up an experiment and track repeated runs of a performance testing experiment. 

Cronograma enforces the formal use of experiments to track both stress tests and load tests. The Experiment model has several attributes, including a hypothesis and one or more orchestrations that are coordinated load tests executed simultaneously in different magnitudes and durations. Also, each experiment has references to the Shopify stores targeted during a test and links to relevant dashboards, tracing, and logs used to make observations.

Once an experiment is defined, it can be run repeatedly. The person running an experiment (the runner) starts the experiment from Slack with a command that creates a new experiment run. Cronograma kicks off the experiment and assigns a dedicated Slack channel for the tests allowing multiple people to participate. During the running of an experiment any number of things could happen including exceptions, elevated traffic levels, and in some cases, actual people may be paged. We want to record all of these things. It’s nice to have all of the details visible in Slack, especially when working with teams that are Digital by Default. Observations can be made by anyone and comments are captured from Slack and added to a timeline for the run. Once the experiment completes, the experiment runner terminates the run and logs a conclusion based on their observations that relates back to the original hypothesis.

We also included additional fanciness in Cronograma. The tool automatically detects whether any important monitors or alerts were triggered during the experiment from internal or third-party data monitoring applications. Whenever an alert is triggered, it is logged in the timeline for the experiment. We also retrieve metrics from our data warehouse automatically and consume these data in Cronograma allowing developers to track observed metrics between runs of the same experiment. For example:

  • the response times of the requests made
  • how many 5xx errors were observed
  • how many requests per minute (RPM) were generated

All of this information is automatically captured so that running an experiment is useful and it can be compared to any other run of the experiment. It’s imperative to understand whether a service is getting better or worse over time.

Cronograma is the home of formal performance testing experiments at Shopify. This application provides a place for all developers to conduct experiments and repeat past experiments. Hypotheses, observations, and conclusions are available for everyone to browse and compare to. All of the tools mentioned here have led to numerous performance improvements and optimizations across the platform, and they give us confidence that we can handle the next major sales event that comes our way.

The Best Things Go Unnoticed 

Our merchants count on Shopify being fast and stable for all their traffic—whether they’re making their first sale, or they’re processing tens of thousands of orders per hour. We prepare for the worst case scenarios by proactively testing the performance of our core product, services, and apps. We expose problems and fix them before they become a reality for our merchants using simulations. By building a culture of load testing across all teams at Shopify, we’re prepared to handle sales events like BFCM and flash sales. My team’s tools make performance testing approachable for every developer at Shopify, and by doing so, we create a stronger platform for all our merchants. It’s easy to go unnoticed when large sales events go smoothly. We quietly rejoice in our efforts and the realization that it’s through strength and conditioning that we make these things possible.

 

Performance Scale at Shopify

Join Chris as he chats with Anita about how Shopify conditions our platform through performance testing on a massive scale to ensure that all components of Shopify’s platform can withstand the rush of customers trying to purchase during sales events like flash sales and Black Friday Cyber Monday.  

Chris Inch is a technology leader and development manager living in Kitchener, Ontario, Canada. By day, he manages Engineering teams at Shopify, and by night, he can be found diving head first into a variety of hobbies, ranging from beekeeping to music to amateur mycology.


We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone

Continue reading

Vouching for Docker Images

Vouching for Docker Images

If you were using computers in the ‘90s and the early 2000s, you probably had the experience of installing a piece of software you downloaded from the internet, only to discover that someone put some nasty into it, and now you’re dragging your computer to IT to beg them to save your data. To remedy this, software developers started “signing” their software in a way that proved both who they were and that nobody tampered with the software after they released it. Every major operating system now supports code or application signature verification, and it’s a backbone of every app store.But what about Kubernetes? How do we know that our Docker images aren’t secret bitcoin miners, stealing traffic away from customers to make somebody rich? That’s where Binary Authorization comes in. It’s a way to apply the code signing and verification that modern systems now rely on to the cloud. Coupled with Voucher, an open source project started by my team at Shopify, we’ve created a way to prevent malicious software from being installed without making developers miserable or forcing them to learn cryptography.

Why Block Untrusted Applications?

Your personal or work computer getting compromised is a huge deal! Your personal data being stolen or your computer becoming unusable due to popup ads or background processes doing tasks you don’t know about is incredibly upsetting.

But imagine if you used a compromised service. Imagine if your email host ran in Docker containers in a cluster with a malicious service that wanted to access contents of the email databases? This isn’t just your data, but the data of everyone around you.

This is something we care about deeply at Shopify, since trust is a core component of our relationship with our merchants and our merchants’ relationships with their customers. This is why Binary Authorization has been a priority for Shopify since our move to Kubernetes.

What is Code Signing?

Code signing starts by taking a hash of your application. Hashes are made with hashing algorithms that take the contents of something (such as the binary code that makes up an application) and make a short, reproducible value that represents that version. A part of the appeal of hashing algorithms is that it takes an almost insurmountable amount of work (provided you’re using newer algorithms) to find two pieces of data that produce the same hash value.

For example, if you have a file that has the text:

Hello World

The hash representation of that (using the “sha256” hashing algorithm) is:

d2a84f4b8b650937ec8f73cd8be2c74add5a911ba64df27458ed8229da804a26

Adding an exclamation mark to the end of our file:

Hello World!

Results in a completely different hash:

03ba204e50d126e4674c005e04d82e84c21366780af1f43bd54a37816b6ab340

Once you have a hash of an application, you can run the same hashing algorithm on it to ensure that it hasn’t changed. While this is an important part of the code signing, most signing applications will automatically create a hash of what you are signing, rather than requiring you to hash and then sign the hash separately. It makes the hash creation and verification transparent to the developers and their users.

Once the initial release is ready, the developer that’s signing the application creates a public and private key for signing it, and shares the public key with their future users. The developer then uses the private part of their signing key and the hash of the application to create a value that can be verified with the public part of the key.

For example, with Minisign, a tool for creating signatures quickly, first we create our signing key:

The public half of the key is now:

RWSs3jHbeTsmYhWlyqpDEufCe5QSGHsb1fFnglZItPwDfJ3wEZzSGyBJ

And the private half remains private, living in /Users/caj/.minisign/minisign.key.

Now, if our application was named “hello” we can create a signature with that private key:

And then your users could verify that “hello” hasn’t been tampered with by running:

Unless you’re a software developer or power user, you likely have never consciously verified a signature, but that’s what’s happening behind the scenes if you’re using a modern operating system. 

Where is Code Signing Used?

Two of the biggest users of code signing are Apple and Google. They use code signing to ensure that you don’t accidentally install software updates that have been tampered with or malicious apps from the internet. Signatures are usually verified in the background, and you only get notified if something is wrong. In Android, you can turn this off by allowing unknown apps in the phone's settings, whereas iOS requires the device be jailbroken to allow unsigned applications to be installed.

A macOS dialog window showing that Firefox is damaged and can't be opened. It gives users the option of moving it to the Trash.


A macOS dialog window showing that Firefox is damaged and can't be opened. It gives users the option of moving it to the Trash.

In macOS, applications that are missing their developer signatures or don’t have valid signatures are blocked by the operating system and advise users to move them to the Trash.

Most Linux package managers, (such as Apt/DPKG in Debian and Ubuntu, Pacman in ArchLinux) use code signing to ensure that you’re installing packages from the distribution maintainer, and verify those packages at install time.

Docker Hub showing a docker image created by the author.
Docker Hub showing a docker image created by the author.

Unfortunately, Kubernetes doesn’t have this by default. There are features that allow you to leverage code signing, but chances are you haven’t used them.

And at the end of the day, do you really trust some rando on the internet to actually give you a container that does what it says it does? Do you want to trust that for your organization? Your customers?

What is Binary Authorization?

Binary Authorization is a series of components that work together: 

  • A metadata service: a service that stores signatures and other image metadata
  • A Binary Authorization Enforcer: a service that blocks images that it can’t find valid signatures for
  • A signing service: a system that signs new images and stores those signatures in the metadata service.

Google provides the first two services for their Kubernetes servers, which Shopify uses, based on two open source projects:

  • Grafeas, a metadata service
  • Kritis, a Binary Authorization Enforcer

When using Kritis and Grafeas or the Binary Authorization feature in Google Kubernetes Engine (GKE), infrastructure developers will configure policies for their clusters, listing the keys (also referred to as attestors) that must have signed the container images before they can run.

When new resources are started in a Kubernetes cluster, the images they reference are sent to the Binary Authorization Enforcer. The Enforcer connects to the metadata service to verify the existence of valid signatures for the image in question and then compares those signatures to the policy for the cluster it runs in. If the image doesn’t have the required signatures, it’s blocked, and any containers that would use it won’t start.You can see how these two systems work together to provide the same security that one gets in one’s operating system! However, there’s one piece that wasn’t provided by Google until recently: the signing service.

Voucher: The Missing Piece

Voucher serves as the last piece for Binary Authorization, the signing service. Voucher allows Shopify to run security checks against our Docker images and sign them depending on how secure they are, without requiring that non-security teams manage their signing keys.

Using Voucher's client software to check an image with the 'is_shopify' check, which verifies if the image was from a Shopify owned repository.
Using Voucher's client software to check an image with the 'is_shopify' check, which verifies if the image was from a Shopify owned repository.

The way it works is simple:

  1. Voucher runs in Google Cloud Run or Kubernetes and is accessible as a REST endpoint
  2. Every build pipeline automatically calls to Voucher with the path to the image it built
  3. Voucher reviews the image, signs it, and pushes that signature to the metadata service

On top of the basic code signing workflow discussed previously, Voucher also supports validating more complicated requirements, using separate security checks and associated signing keys to mix and match required signatures on a per cluster basis to create distinct policies based on a cluster’s requirement.

For example, do you want to block images that weren’t built internally? Voucher has a distinct check for verifying that an image is associated with a Git commit in a Github repo you own, and signing those images with a separate key.

Alternatively, do you need to be able to prove that every change was approved by multiple people? Voucher can support that, creating signatures based on the existence of approvals in Github (with support for other code hosting services coming soon). This would allow you to use Binary Authorization to block images that would violate that requirement.

Voucher also has support for verifying the identity of the container builder, blocking images with a high number of vulnerabilities, and so on. And Voucher was designed to be extensible, allowing for the creation of new checks as need be.By combining Voucher’s checks and Binary Authorization policies, infrastructure teams can create a layered approach to securing their organization’s Kubernetes clusters. Compliance clusters can be configured to require approvals and block images with vulnerabilities, while clusters for experiments and staging can use less strict policies to allow developers to move faster, all with minimum work from non-security focused developers.

Voucher Joins Grafeas

As mentioned earlier, Voucher serves a need that hasn’t been provided by Google until recently. This is because Voucher has moved into the Grafeas organization and now is a service provided by Google to Google Kubernetes Engine users going forwards. 

Since our move to Kubernetes, Shopify’s security team has been working with Google’s Binary Authorization team to plan out how we’ll roll out Binary Authorization and design Voucher. We also released Voucher as an open source project in December 2018. This move to the Grafeas project simplifies things, putting it in the same place as the other open source Binary Authorization components.

Improving the security of the infrastructure we build makes everyone safer. And making Voucher a community project will put it in front of more teams which will be able to leverage it to further secure their Kubernetes clusters, and if we’re lucky, will result in a better, more powerful Voucher! Of course, Shopify’s Software Supply Chain Security team will continue our work on Voucher, and we want you to join us!

Please help us!

If you’re a developer or writer who has time and interest in helping out, please take a look at the open issues or fork the project and open a PR! We can always use more documentation changes, tutorials, and third party integrations!

And if you’re using Voucher, let us know! We’d love to hear how it’s going and how we can do a better job of making Kubernetes more secure for everyone!


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

How to Build a Production Grade Workflow with SQL Modelling

How to Build a Production Grade Workflow with SQL Modelling

By Michelle Ark and Chris Wu

In January of 2014, Shopify built a data pipeline platform for the data science team called Starscream. Back then, we were a smaller team and needed a tool that could deal with everything from ad hoc explorations to machine learning models. We chose to build with PySpark to get the power of a generalized distributed computer platform, the backing of the industry standard, and the ability to tap into the Python talent market. 

Fast forward six years and our data needs have changed. Starscream now runs 76,000 jobs and writes 300 terabytes a day! As we grew, some types of work went away, but others (like simple reports) became so commonplace we do them every day. While our Python tool based on PySpark was computationally powerful, it wasn’t optimized for these commonplace tasks. If a product manager needed a simple rollup for a new feature by country, pulling it, and modeling it wasn’t a fast task.

We’ll show you how we moved to a SQL modelling workflow by leveraging dbt (data build tool) and created tooling for testing and documentation on top of it. All together, these features provide Shopify’s data scientists with a robust, production-ready workflow to quickly build straightforward pipelines.

The Problem

When we interviewed our users to understand their workflow on Starscream, there were two issues we discovered: development time and thinking.

Development time encompasses the time data scientists use to prototype the data model they’d like to build, run it, see the outcome,and iterate. The PySpark platform isn’t ideal for running straightforward reporting tasks, often forcing data scientists to write boilerplate and it yields long runtimes. This led to long iteration cycles when trying to build models on unfamiliar data.

The second issue, thinking, is more subtle and deals with the way the programming language forces you to look at the data. Many of our data scientists prefer SQL to python because its structure forces consistency in business metrics. When interviewing users, we found a majority would write out a query in SQL then translate it to Python when prototyping. Unfortunately, query translation is time consuming and doesn’t add value to the pipeline.

To understand how widespread these problems were, we audited the jobs run and surveyed our data science team for the use cases. We found that 70% or so of the PySpark jobs on Starscream were full batch queries that didn’t require generalized computing. We viewed this as an opportunity to make a kickass optimization for a painful workflow. 

Enter Seamster

Our goal was to create a SQL pipeline for reporting that enables data scientists to create simple reporting data faster, while still being production ready. After exploring a few alternatives, we felt that the dbt library came closest to our needs. Their tagline “deploy analytics code faster with software engineering practices” was exactly what we were looking for in a workflow. We opted to pair it with Google BigQuery as our data store and dubbed the system and its tools, Seamster.

We knew that any off-the-shelf system wouldn’t be one size fits all. In moving to dbt, we had to implement our own:

  • source and model structure to modularize data model development
  • unit testing to increase the types of testable errors
  • continuous integration (CI) pipelines to provide safety and consistency guarantees.

Source Independence and Safety

With dozens of data scientists making data models in a shared repository, a great user experience would

  • maximize focus on work 
  • minimize the impact of model changes by other data scientists.

By default, dbt declares raw sources in a central sources.yml. This quickly became a very large file as it included the schema for each source, in addition to the source name. It creates a huge bottleneck for teams editing the same file across multiple PRs. 

To mitigate the bottleneck, we leveraged the flexibility of dbt and created a top-level ‘sources’ directory to represent each raw source with its own source-formatted yaml file. This way, data scientists can parse only the source documentation that’s relevant for them and contribute to the sources.yml file without stepping on each other’s toes.

Base models are one-to-one interfaces to raw sources.

We also created a Base layer of models using the staging’ concept from dbt to implement their best practice of limiting references to raw data. Our Base models serve as a one-to-one interface to raw sources. They don’t change the grain of the raw source, but do apply renaming, recasting, or any other cleaning operation that relates to the source data collection system. 

The Base layer serves to protect users from breaking changes in raw sources. Raw external sources are by definition out of the control of Seamster and can introduce breaking changes for any number of reasons at any point in time. If and when this happens, you only need to apply the fix to the Base model representing the raw source, as opposed to every individual downstream model that depends on the raw source. 

Model Ownership for Teams

We knew that the tooling improvements of Seamster would be only one part of a greater data platform at Shopify. We wanted to make sure we’re providing mechanisms to support good dimensional modelling practices and support data discovery.

In dbt, a model is simply a .sql file. We’ve extended this definition in Seamster to define a model as a directory consisting of four files: 

  • model_name.sql
  • schema.yml
  • README.md
  • test_model_name.py

You can further organize models into directories that indicate a data science team at Shopify like ‘finance’ or ‘marketing’. 

To support a clean data warehouse we’ve also organized data models into these rough layers that differentiate between:

  • base: data models that are one-to-one with raw data, but cleaned, recast and renamed
  • application-ready: data that isn’t dimensionally modelled but still transformed and clean for consumption by another tool (for example,  training data for a machine learning algorithm)
  • presentation: shareable and reliable data models that follow dimensional modelling best practices and can be used by data scientists across different domains.

With these two changes, a data consumer can quickly understand the data quality they can expect from a model and find the owner in case there is an issue. We also pass this metadata upstream to other tools to help with the data discovery workflow.

More Tests

dbt has native support for ‘schema tests’, which are encoded in a model’s schema.yml file. These tests run against production data to validate data invariants, such as the presence of null values or the uniqueness of a particular key. This feature in dbt serves its purpose well, but we also want to enable data scientists to write unit tests for models that run against fixed input data (as opposed to production data).

Testing on fixed inputs allows the user to test edge cases that may not be in production yet. In larger organizations, there can and will be frequent updates and many collaborators for a single model. Unit tests give users confidence that the changes they’re making won’t break existing behaviour or introduce regressions. 

Seamster provides a Python-based unit testing framework. Data scientists write their unit tests in the test_model_name.py file in the model directory. The framework enables constructing ‘mock’ input models from fixed data. The central object in this framework is a ‘mock’ data model, which has an underlying representation of a Pandas dataframe. You can pass fixed data to the mock constructor as either a csv-style string, Pandas dataframe, or a list of dictionaries to specify input data. 

Input and expected MockModels are built from static data. The actual MockModel is built from input MockModels by BigQuery. Actual and expected MockModels can assert equality or any Great Expectations expectation
Input and expected MockModels are built from static data. The actual MockModel is built from input MockModels by BigQuery. Actual and expected MockModels can assert equality or any Great Expectations expectation.

A constructor creates a test query where a common table expression (CTE) represents each input mock data model, and any references to production models (identified using dbt’s ‘ref’ macro) are replaced by references to the corresponding CTE. Once you execute a query, you can compare the output to an expected result. In addition to an equality assertion, we extended our framework to support all expectations from the open-source Great Expectations library to provide more granular assertions and error messaging. 

The main downside to this framework is that it requires a roundtrip to the query engine to construct the test data model given a set of inputs. Even though the query itself is lightweight and processes only a handful of rows, these roundtrips to the engine add up. It becomes costly to run an entire test suite on each local or CI run. To solve this, we introduced tooling both in development and CI to run the minimal set of tests that could potentially break given the change. This was straightforward to implement with accuracy because of dbt’s lineage tracking support; we simply had to find all downstream models (direct and indirect) for each changed model and run their tests. 

Schema and Directed Acyclic Graph Validation on the Cheap

Our objective in Seamster’s CI is to give data scientists peace of mind that their changes won’t introduce production errors the next time the warehouse is built. They shouldn’t have to wonder whether removing a column will cause downstream dependencies to break, or whether they made a small typo in their SQL model definition.

To achieve this accurately, we would need to build and tear down the entire warehouse on every commit. This isn’t feasible from both a time and cost perspective. Instead, on every commit we materialize every model as a view in a temporary BigQuery dataset which is created at the start of the validation process and removed as soon as the validation finishes. If we can’t build a view because its upstream model doesn’t provide a certain column, or if the SQL is invalid for any reason, BigQuery fails to build the view and produces relevant error messaging. 

Currently, We have a warehouse consisting of over 100 models, and this validation step takes about two minutes. We reduce validation time further by only building the portion of the directed acyclic graph (DAG) affected by the changed models, as done in the unit testing approach. 

dbt’s schema.yml serves purely as metadata and can contain columns with invalid names or types (data_type). We employ the same view-based strategy to validate the contents of a model’s schema.yml file ensuring the schema.yml is an accurate depiction of the actual SQL model.

Data Warehouse Rules

Like many large organizations, we maintain a data warehouse for reporting where accuracy is key. To power our independent data science teams, Seamster helps by enforcing conformance rules on the layers mentioned earlier (base, application-ready, and presentation layers). Examples include naming rules or inheritance rules which help the user reason over the data when building their own dependent models.

Seamster CI runs a collection of such rules that ensure consistency of documentation and modelling practices across different data science teams. For example, one warehouse rule enforces that all columns in a schema conform to a prescribed nomenclature. Another warehouse rule enforces that only base models can reference raw sources (via the ‘source’ macro) directly. 

Some warehouse rules apply only to certain layers. In the presentation layer, we enforce that any column name needs a globally unique description to avoid divergence of definitions. Since everything in dbt is YAML, most of this rule enforcement is just simple parsing.

So, How Did It Go?

To ensure we got it right and worked out the kinks, we ran a multiweek beta of Seamster with some of our data scientists who tested the system out on real models. Since you’re reading about it, you can guess by now that it went well!

While productivity measures are always hard, the vast majority of users reported they were shipping models in a couple of days instead of a couple of weeks. In addition, documentation of models increased because this is a feature built into the model spec.

Were there any negative results? Of course. dbt’s current incremental support doesn’t provide safe and consistent methods to handle late arriving data, key resolution, and rebuilds. For this reason, a handful of models (Type  2 dimensions or models in the 1.5B+ event territory) that required incremental semantics weren’t doable—for now. We’ve got big plans though!

Where to Next?

We’re focusing on updating the tool to ensure it’s tailored to Shopify’s data scientists. The biggest hurdle for a new product (internal and external) is adoption. We know we still have work to do to ensure that our tool is top of mind when users have simple (but not easy) reporting work. We’re spending time with each team to identify upcoming work that we can speed up by using Seamster. Their questions and comments will be part of our tutorials and documentations for new data scientists.

On the engineering front, an exciting next step is looking beyond batch data processing. Apache Beam and Beam SQL provide an opportunity to consider a single SQL-centric data modelling tool for both batch and streaming use cases.

We’re also big believers in open source at Shopify. Depending on the dbt’s community needs we’d also like to explore contributing our validation strategy and a unit testing framework to the project. 


If you’re interested in building solutions from the ground up and would like to come work with us, please check out Shopify’s career page.

Continue reading

Adopting Sorbet at Scale

Adopting Sorbet at Scale

On November 25, 2020 we held ShipIt! Presents: The State of Ruby Static Typing at Shopify. The video of the event is now available.

Shopify changes a lot. We merge around 400 commits to the main branch daily and deploy a new version or our monolith 40 times a day. Shopify is also big: 37,000 Ruby files, 622,000 methods, more than 2,000,000 calls. At this scale, with a dynamic language, even with the most rigorous review process and over 150 000 tests, it’s a challenge to ensure that everything runs smoothly. Developers benefit from a short feedback loop to ensure the stability of our monolith for our merchants.

In my first post, I talked about how we brought static typing to our core monolith. We adopted Sorbet in 2019, and the Ruby Infrastructure team continues to work on ways to make the development process safer, faster, and more enjoyable for Ruby developers. Currently, Sorbet is only enforced on our main monolith, but we have 60 internal projects using Sorbet as well. On our main monolith, we require all files to be at least typed: false and Sorbet is run on our continuous integration (CI) platform for every pull request and fails builds if type checking errors are found. As of today, 80% of our files (including tests) are typed: true or higher. Almost half of our calls are typed and half of our methods have signatures.

In this second post, I’ll present how we got from no Sorbet in our monolith to almost full coverage in the span of a few months. I’ll explain the challenges we faced, the tools we built to solve them, and the preliminary results of our experiment to reduce production errors with static typing.

Our Open-Source Tooling for Sorbet Adoption

Currently, Sorbet can’t understand all the constructs available in Ruby. Furthermore, Shopify relies on a lot of gems and frameworks, including Rails, that bring their own set of idioms. Increasing type coverage in our monolith meant finding ways to make Sorbet understand all of this. These are the tools we created to make it possible. They are open sourced in our effort to share our work with the community and make typing adoption easier for everyone.

Making Code Sorbet-compatible with RuboCop Sorbet

Even with gradual typing, moving our monolith to Sorbet required a lot of changes to remove or replace Ruby constructs that Sorbet couldn’t understand, such as non-constant superclasses or accessing constants through meta-programming with const_get. For this, we created RuboCop Sorbet, a suite of RuboCop rules allowing us to:

  • forbid some of the constructs not recognized by Sorbet yet 
  • automatically correct those constructs to something Sorbet can understand.

We also use these cops to require a minimal typed level on all files of our monolith (at least typed: false for now, but soon typed: true) and enforce some styling conventions around the way we write signatures.

Creating RBI Files for Gems with Tapioca

One considerable piece missing when we started using Sorbet was Ruby Interface file (RBI) generation for gems. For Sorbet to understand code from required gems, we had two options: 

  1. pass the full code of the gems to Sorbet which would make it slower and require making all gems compatible with Sorbet too
  2. pass a light representation of the gem content through an Ruby Interface file called a RBI file.

Being before the birth of Sorbet’s srb tool, we created our own: Tapioca. Tapioca provides an automated way to generate the appropriate RBI file for a given gem with high accuracy. It generates the definitions for all statically defined types and most of the runtime defined ones exported from a Ruby gem. It loads all the gems declared in the dependency list from the Gemfile into memory, then performs runtime introspection on the loaded types to understand their structure, and finally generates a complete RBI file for each gem with a versioned filename.

Tapioca is the de facto RBI generation tool at Shopify and used by a few renowned projects including Homebrew.

Creating RBI Files for Domain Specific Languages

Understanding the content of the gems wasn’t enough to allow type checking our monolith. At Shopify we use a lot of internal Domain Specific Languages (DSLs), most of them coming directly from Rails and often based on meta-programming. For example, the Active Record association belongs_to ends up defining tens of methods at runtime, none of which are statically visible to Sorbet. To enhance Sorbet coverage on our codebase we needed it to “see” those methods.

To solve this problem, we added RBI generation for Rails DSLs directly into Tapioca. Again, using runtime introspection, Tapioca analyzes the code of our application to generate RBI files containing a static definition for all the runtime-generated methods from Rails and other libraries.

Today Tapioca provides RBI generation for a lot of DSLs we use at Shopify:

  • Active Record associations
  • Active Record columns
  • Active Record enums
  • Active Record scopes
  • Active Record typed store
  • Action Mailer
  • Active Resource
  • Action Controller helpers
  • Active Support current attributes
  • Rails URL helpers
  • FrozenRecord
  • IdentityCache
  • Google Protobuf definitions
  • SmartProperties
  • StateMachines
  • …and the list is growing everyday

Building Tooling on Top of Sorbet with Spoom

As we began using Sorbet, the need for tooling built on top of it was more and more apparent. For example, Tapioca itself depends on Sorbet to list the symbols for which we need to generate RBI definitions.

Sorbet is a really fast Ruby parser that can build an Abstract Syntax Tree (AST) of our monolith in a matter of seconds versus a few minutes for the Whitequark parser. We believe that in the future a lot of tools such as linters, cops, or static analyzers can benefit from this speed.

Sorbet also provides a Language Server Protocol (LSP) with the option --lsp. Using this option, Sorbet can act as a server that is interrogated by other tools programmatically. LSP scales much better than using the file output by Sorbet with the --print option (see for example parse-tree-json or symbol-table-json) that spits out GBs of JSON for our monolith. Using LSP, we get answers in a few milliseconds instead of parsing those gigantic JSON files. This is generally how the language plugins for IDEs are implemented.

To facilitate the development of external tools to Sorbet we created Spoom, our toolbox to use Sorbet programmatically. It provides a set of useful features to interact with Sorbet, parse the configuration files, list the type checked files, collect metrics, or automatically bump files to higher strictnesses and comes with a Ruby API to connect with Sorbet’s LSP mode.

Today, Spoom is at the heart of our typing coverage reporting and provides the beautiful visualizations used in our SorbetMetrics dashboard.

Sharing Lessons Learned

After more than a year using Sorbet on our codebases, we learned a lot. I’ll share some insights about what typing did for us, which benefits it brings, and some of the limitations it implies.

Build, Measure, Learn

There’s a very scientific way to approach building products, encapsulated in the Build-Measure-Learn loop pioneered by Eric Ries. Our team believes in intuition, but we still prefer empirical proofs when we have access to them. So when we started with static typing in Ruby, we all believed it would be useful for our developers, but wanted to measure its effects and have hard data. This allows us to decide what we should concentrate on next based on the outcome of our measurements.

I talked about observing metrics, surveying developer happiness, or getting feedback through interviews in part 1, but my team wanted to go further and correlate the impact of typing on production errors. So, we conducted a series of controlled experiments to validate our assumptions.

Since our monolith evolves very fast, it becomes hard to observe the direct impact of typing on production. New features are added every day which gives rise to new errors while we work to decrease errors in other areas. Moreover, our monolith has about 500 gem dependencies (including transitive dependencies), any of which could introduce new errors in a version bump.

For this reason, we decreased our scope and targeted a smaller codebase for our experiment. Our internal developer tool, aptly named dev, was an ideal candidate. It’s a mature codebase that changes slowly and by a few people. It’s a very opinionated codebase with no external dependencies (the few dependencies it has are vendored), so it could satisfy the performance requirements of a command-line tool. Additionally, dev almost uses no meta-programming, especially not the kind normally coming from external libraries. Finally, it’s a tool with heavy usage since it’s the main development tool used by all developers at Shopify for their day-to-day work. It runs thousands of times a day on hundreds of different computers, there’s no edge case—at this scale, if something can break, it will.

We started monitoring all errors raised by dev in production, categorized the errors, analyzed their root cause, and tried to understand how typing could avoid them.

typed: ignore means typed: debt

Our first realisation was to keep away from typed: ignore. Ignoring a file can cause errors to appear in other files because those other files may reference something defined in the ignored file.

For example, if we opt to ignore this file:

Sorbet will raise errors in this file:

Since Sorbet doesn't even parse the file a.rb, it won’t know where constant A was defined. The more files you ignore, the more this case arises, especially when ignoring library files. This makes it harder and harder for other developers to type their own code.

As a rule of thumb at Shopify, we aim to have all our application files at least at typed: true and our test files at least at typed: false. We reserve typed: ignore for some test files that are particularly hard to type (because of mocking, stubbing, and fixtures), or some very specific files such as Protobuf definition files (which we handle through DSLs RBI generation with Tapioca).

Benefits Realized, Even at typed: false

Even at typed: false, Sorbet provides safety in our codebase by checking that all the constants resolve. Thanks to this, we now avoid mistakes triggering NameErrors either in CI or production.

Enabling Sorbet on our monolith allowed us to find and fix a few mistakes such as:

  • StandardException instead of StandardError
  • NotImplemented instead of NotImplementedError

We found dead code referencing constants deleted months ago. Interestingly, while most of the main execution paths were covered by tests, code paths for error handling were the places where we found the most NameErrors.

A bar graph showing the decreasing amount of NameErrors in dev over time
NameErrors raised in production for the dev project

During our experiment, we started by moving all files from dev to typed: false without adding any signatures. As soon as Sorbet was enabled in October 2019 on this project, no more NameErrors were raised in production.

Stacktrace showing NameError raised in production after Sorbet was enabled because of meta-programming like const_get
NameError raised in production after Sorbet was enabled because of meta-programming

The same observation was made on multiple projects: enabling Sorbet on a codebase eradicates all NameErrors due to developers’ mistakes. Note that this doesn’t avoid NameErrors triggered through metaprogramming, for example, when using const_get.

While Sorbet is a bit more restrictive when it comes to resolve constants, this strictness can be beneficial for developers:

Example of constant resolution error raised by Sorbet

typed: true Brings More Benefits

A circular tree map showing the relationship between strictness level and helpers in dev
Files strictnesses in dev (the colored dots are the helpers)

With our next experiment on gradual typing, we wanted to observe the effects of moving parts of the dev application to typed: true. We moved a few of the typed: false files to typed: true by focusing on the most reused part of the application, called helpers (the blue dots).

A bar graph showing the decrease in NoMethodErrors for files typed: true over time
NoMethodErrors in production for dev (in red the errors raised from the helpers)

By typing only this part of the application (~20% of the files) and still without signatures, we observed a decrease in NoMethodErrors for files typed: true.

Those preliminary results gave us confidence that a stricter typing can impact other classes of errors. We’re now in the process of adding signatures to the typed: true files in dev so we can observe their effect on TypeErrors and ArgumentErrors in production.

The Road Ahead of Us

The team working on Sorbet adoption is part of the broader Ruby Infrastructure team which is responsible for delivering a fast, scalable, and safe Ruby language for Shopify. As part of that mandate, we believe there are more things we can do to increase Ruby performance when the types of variables are known ahead of time. This is an interesting area to explore for the team as our adoption of typing increases and we’re certainly thinking about investing in this in the near future.

Support for Ruby 2.7

We keep our monolith as close as possible to Ruby trunk. This means we moved to Ruby 2.7 months ago. Doing so required a lot of changes in Sorbet itself to support syntax changes such as beginless ranges and numbered parameters, as well as new behaviors like forbidding circular argument references. Some work is still in progress to support the new forwarding arguments syntax and pattern matching. Stripe is currently working on making the keyword arguments compatible with Ruby 2.7 behavior.

100% Files at typed: true

The next objective for our monolith is to move 100% of our files to at least typed: true and make it mandatory for all new files going forward. Doing so implies making both Sorbet and our tooling smarter to handle the last Ruby idioms and constructs we can’t type check yet.

We’re currently focusing on changing Sorbet to support Rails ActiveSupport::Concerns (Sorbet pull requests #3424, #3468, #3486) and providing a solution to inclusion requirements. As well as, improvement to Tapioca for better RBI generation for generics and GraphQL support.

Investing in the Future of Static Typing for Ruby

Sorbet isn’t the end of the road concerning Ruby type checking. Matz announced that Ruby 3 will ship with RBS, another approach to type check Ruby programs. While the solution isn’t yet mature enough for our needs (mainly because of speed and some other limitations explained by Stripe) we’re already collaborating with Ruby core developers, academics, and Stripe to make its specification better for everyone.

Notably, we have open-sourced RBS parser, a C++ parser for RBS capable of translating a subset of RBS to RBI, and are now working on making RBS partially compatible with Sorbet.

We believe that typing, whatever the solution used is greatly beneficial for Ruby, Shopify, and our merchants so we'll continue to invest heavily in it. We want to increase the typing for the whole community.

We’ll continue to work with collaborators to push typing in Ruby even further. As we lessen the effort needed to adopt Sorbet, specifically on Rails projects, we’ll start making gradual typing mandatory on more internal projects. We will help teams start adopting at typed: false and move to stricter typing gradually.

As our long term goal, we hope to bring Ruby on par with compiled and statically typed languages regarding safety, speed and tooling.

Do you want to be part of this effort? Feel free to contribute to Sorbet (there are a lot of good first issues to begin with), check our many open-source projects or take a look at how you can join our team.

Happy typing!

—The Ruby Infrastructure Team


We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.

Continue reading

Static Typing for Ruby

Static Typing for Ruby

On November 25, 2020 we held ShipIt! Presents: The State of Ruby Static Typing at Shopify. The video of the event is now available.

Shopify changes a lot. We merge around 400 commits to the main branch daily and deploy a new version of our core monolith 40 times a day. The Monolith is also big: 37,000 Ruby files, 622,000 methods, more than 2,000,000 calls. At this scale with a dynamic language, even with the most rigorous review process and over 150,000 automated tests, it’s a challenge to ensure everything runs smoothly. Developers benefit from a short feedback loop to ensure the stability of our monolith for our merchants.

Since 2018, our Ruby Infrastructure team has looked at ways to make the development process safer, faster, and more enjoyable for Ruby developers. While Ruby is different from other languages and brings amazing features allowing Shopify to be what it is today, we felt there was a feature from other languages missing: static typing.

The Three Key Requirements for a Typing Solution in Ruby

Even in 2018, typing for Ruby wasn't a novelty. A few attempts were made to integrate type annotations directly into the language or through external tools (RDL, Steep), or as libraries (dry-types). 

Which solution would best fit Shopify considering its codebase and culture? For the Ruby Infrastructure team, the best match for a typing solution needs:

  • Gradual typing: Typing a monolith isn't a simple task and can’t be done in a day. Our code evolves fast, and we can’t ask developers to stop coding while we add types to the existing codebase. We need flexibility to add types without blocking the development process or limiting our ability to satisfy merchants needs.
  • Speed: Considering the size of Shopify’s codebase, speed is a concern. If our goal is to provide quick feedback on errors and remove pressure from continuous integration (CI), we need a solution that’s fast.
  • Full Ruby support: We use all of Ruby at Shopify. Our culture embraces the language and benefits from all features, even hard to type ones like metaprogramming, overloading, and class reopening. Support for Rails is also a must. From code elegance to developer happiness, the chosen solution needs to be compatible as much as possible with all Ruby features.

With such a list of requirements, none of the contenders at the time could satisfy our needs, especially the speed requirement. We started thinking about developing our own solution, but a perfectly timed meeting with Stripe, who were working on a solution to the problem, introduced us to Sorbet.

Sorbet was closed-source at the time and under heavy development but was already promising. It’s built for gradual typing with phenomenal performance (able to analyze 100,000 lines per second per core) making it significantly faster than running automated tests. It can handle hard to type things like metaprogramming, thanks to Ruby Interface files (RBI). This is how, at the start of 2019, Shopify began its journey toward static type checking for Ruby.

Treat Static Typing as a Product

With only a three-person team and a lot on our plate, fully typing our monolith with Sorbet was going to be an approach based on Shopify’s Get Shit Done (GSD) framework.

  1. We tested the viability of Sorbet on our core monolith by only typing a few files, to check if we could observe benefits from it while not impairing other developers’ work. Sorbets’ gradual approach proved to be working.
  2. We manually created RBI files to represent what Sorbet could not understand yet. We checked we supported Ruby’s most advanced features as well as Rails constructs.
  3. We added more and more files while keeping an eye on performance ensuring Sorbet would scale with our monolith.

This gave us confidence Sorbet was the right choice to solve our problem. Once we officially decided to go with Sorbet we reflected on how we can reach 100% adoption in the monolith. To determine our roadmap we looked at:

  • how many files needed to be typed
  • the content of the files
  • the Ruby features they used.

Track Static Typing Adoption

Type checking in Sorbet comes in different levels of strictness. The strictness is defined on a per file basis by adding a magic comment in the file, called a sigil, written # typed: LEVEL, where LEVEL can be one of the following: 

  • ignore: At this level, the file is not even read by Sorbet, and no errors are reported for this file at all.
  • false: Only errors related to syntax, constant resolution and correctness of sigs are reported. At this level sorbet doesn’t check the calls in the files even if the methods called don't exist anywhere in the codebase.
  • true: This is the level where Sorbet actually starts to type check your code. All methods called need to exist in the code base. For each call, Sorbet will check that the arguments count matches the method definition. If the method has a signature, Sorbet will also check their types.
  • strict: At this level all methods must have a signature, and all constants and instance variables must have explicitly annotated types.
  • strong: Sorbet no longer allows untyped variables. In practice, this level is actually unusable for most files because Sorbet can’t type everything yet and even Stripe advises against using it.

Once we were able to run Sorbet on our codebase, we needed a way to track our progress and identify which parts of our monolith were typed with which strictness or which parts needed more work. To do so we created SorbetMetrics, a tool able to collect and display metrics about typing coverage for all our internal projects. We started tracking three key metrics to measure Sorbet adoption :

  • Sigils: how many files are typed ignore, false, true, strict or strong
  • Calls: how many calls are sent to a method with a signature
  • Signatures: how many methods have a signature

Bar graph showing increased Sorbet usage in projects over time. Below the bar graph is a table showing the percentage of sigils, calls, signatures in each project.
SorbetMetrics dashboard homepage

Each day SorbetMetrics pulls the latest version of our monolith and other Shopify projects using Sorbet, computes those metrics and displays them in a dashboard internally available to all our developers.

A selection of charts from the SorbetMetrics Dashboard. 3 pie charts showing the percentage of sigils, calls, and signatures in the monolith. 3 line charts showing Sigils, calls, and signature percentage over time. A circular tree map showing the relationship between strictness level and components. 2 line charts showing Sorbet versions and typechecking time over time
SorbetMetrics dashboard for our monolith

Sorbet Support at Scale

If we treat typing as a product, we also need to focus on supporting and enabling our “customers” who are developers at Shopify. One of our goals was to have a strong support system in place to help with any problems that arise and slow developers down.

Initially, we supported developers with a dedicated Slack channel where Shopifolk could ask questions to the team. We’d answer these questions real-time and help Shopifolk with typing efforts where our input was important.

This white glove support model obviously didn't scale, but it was an excellent learning opportunity for our team—we now understood the biggest challenges and recurring problems. We ended up solving some problems over and over again, but it solidified the effort to understand the patterns and decide which features to work on next.

Using Slack meant our answers weren't discoverable forever. We moved most of the support and conversation to our internal Discourse platform, increasing discoverability and broader sharing of knowledge. This also allows us to record solutions in a single place and let developers self-serve as much as possible. As we onboard more and more projects with Sorbet, this solution scales better.

Understand Developer Happiness

Going further from unblocking our users, we also need to ensure their happiness. Sorbet and more generally static typing in Ruby wouldn’t be a good fit for us if it made our developers miserable. We’re aware that it introduces a bit more work, so the benefits need to balance with the inconvenience.

Our first tool to measure developers’ opinions of Sorbet is surveys. Twice a year, we send a “Typing @ Shopify” survey to all developers and collect their sentiments regarding Sorbet’s benefits and limitations, as well as what we should focus on in the future.

A bar graph showing the increasing strongly agree answer over time to the question I want Sorbet to be applied to other Shopify projects. Below that graph is a bar graph showing the increasing strongly agree answer over time to the question I want more code to be typed.
Some responses from our “Sorbet @ Shopify” surveys

We use simple questions (“yes” or “no”, or a “Strongly Disagree” (1) to “Strongly Agree” (5) scale) and then look at how the answers evolve over time. The survey results gave us interesting insights:

  • Sorbet catches more errors on developer’s pull requests (PR) as adoption increased
  • Signatures help with code understanding and give developers confidence to ship
  • Added confidence directly impacted the increasing positive opinion about static typing in Ruby
  • Over time developers wanted more code and more projects to be typed
  • Developers get used to Sorbet syntax over time
  • IDE integration with Sorbet is a feature developers are rooting for

Our main observation is that developers enjoy Sorbet more as the typing coverage increases. This is one reason that's increasing our motivation to reach 100% of files at typed: true and maximize the amount of methods with a signature.

The second tool is interviews with individual developers. We select a team working with Sorbet and meet each member to talk about their experience using Sorbet either in the monolith or on another project. We get a better understanding of what their likes and dislikes are, what we should be improving, but also how we can better support them when introducing Sorbet, so the team keeps Sorbet in their project.

The Current State of Sorbet at Shopify

Currently, Sorbet is only enforced on our main monolith and we have about 60 other internal projects that opted to use Sorbet as well. On our main monolith, we require all files to be at least typed: false and Sorbet is run on our continuous integration platform (CI) for every PR and fails builds if type checking errors are found. We’re currently evaluating the idea of enforcing valid type checking on CI even before running the automated tests.

Three pie charts showing percentage of sigils, calls, and signatures in the monolith used to measure Sorbet adoption
Typing coverage metrics for Shopify’s monolith

As of today, 80% of our files (including tests) are typed: true or higher. Almost half of our calls are typed and half of our methods have signatures. All of this can be type checked under 15 seconds on our developers machines.

A circular tree map showing the relationship between strictness level and components
Files strictness map in Shopify’s monolith

The circle map shows which parts of our monolith are at which strictness level. Each dot represents a Ruby file (excluding tests). Each circle represents a component (a set of Ruby files serving the same application concern). Yes, it looks like a Petri dish and our goal is to eradicate the bad orange untyped cells.

A bar graph showing increased number of Shopify projects using Sorbet over time
Shopify projects using Sorbet

Outside of the core monolith, we’ve also observed a natural increase of Shopify projects, both internal and open-source, using Sorbet. As I write these lines, more than 60 projects now use Sorbet. Shopifolks like Sorbet and use it on their own without being forced to do so.

A bar graph showing manual dev tc runs from developers machine on our monolith in 2019
Manual dev tc runs from developers machine on our monolith in 2019

Finally, we track how many times our developers ran the command dev tc to typecheck a project with Sorbet on their development machine. This shows us that developers don’t wait for CI to use Sorbet—everyone enjoys a faster feedback loop.

The Benefits of Types for Ruby Developers 

Now that Sorbet is fully adopted in our core monolith, as well as in many internal projects, we’re starting to see the benefits of it on our codebases as well as on our developers. Our team that is working on Sorbet adoption is part of the broader Ruby Infrastructure team which is responsible for delivering a fast, scalable and safe Ruby language for Shopify. As part of that mandate, we believe that static typing has a lot to offer for Ruby developers, especially when working on big, complex codebases.

In this post I focused on the process we followed to ensure Sorbet was the right solution for our needs, treating static typing as a product and showed the benefits of this product on our customers: Shopifolk working on our monolith and outside. Are you curious to know how we got there? Then you’ll be interested in the second part: Adopting Sorbet at Scale where I present the tools we built to make adoption easier and faster, the projects we open-sourced to share with the community and the preliminary results of our experiment with static typing to reduce production errors.

Happy typing!

—The Ruby Infrastructure Team


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.

Continue reading

Start your free 14-day trial of Shopify