Building a Dynamic Mobile CI System

18 minute read

The mobile space has changed quickly, even within the past few years. At Shopify, the world’s largest Rails application, we have seen the growth and potential of the mobile market and set a goal of becoming a mobile-first company. Today, over 130,000 merchants are using Shopify Mobile to set up and run their stores from their smartphones. Through the inherent simplicity and flexibility of the mobile platform, many mobile-focused products have found success.

Our production engineering team recently revamped our old continuous integration setup to be more dynamic and built to scale from the ground up. Previously, the environments were shared between projects, with capacity statically assigned to either Android or iOS. This made evolving the configuration difficult because it required updating all projects at the same time.

We needed a system to replace our preconfigured Mac Minis that was faster, more reliable and could scale to a larger number of builds. We set out to build something that, in terms of performance, took less than 10 minutes per build and was scalable across our entire engineering team. In this blog post we’ll share how we built the new system, how it works, and what we learned in the process.

This post was co-written with Arham Ahmed, and shout-outs to Sean Corcoran of MacStadium and Tim Lucas of Buildkite.

Sander Lijbrink

Continue reading →

Shopify's off to San Francisco for SREcon Americas

Shopify production engineering will be at SREcon Americas 2017, held on March 13 and 14. We'll be sharing our best practices and the lessons we’ve learned along the way. 

Tracking Service Infrastructure at Scale - John Arthorne

With the Shopify SRE team focused mainly on the most critical systems, we faced a looming crisis with hundreds of other important applications suffering from lack of clear ownership, inconsistent infrastructure, and poor automation. Here are some of the tools Shopify built for scaling out strong SRE infrastructure and practices across a large fleet of applications.

Monday, March 13, 2017 at 2:50 pm to 3:40 pm, Track 2

Lightning Talks (times to be confirmed)

Data Center Automation at Shopify - David Radcliffe

The flexibility and speed offered by cloud computing solutions have raised the bar for bare metal deployments. Automation is essential to speedy and reliable provisioning and capacity management. We’ll share some of the tools we’ve utilized to automate our data center and empower our developers to move quickly and keep up with the times.

Monday, March 13, 2017 at 2:50 pm to 3:40 pm, Track 3

Four-Minute Deploys: No Engineers Necessary - Lei Lopez

Previously, our devs had to ping an SRE to deploy. This process was prone to error and wasteful. Today, a chatbot automatically deploys Shopify over 40 times a day in four minutes, without losing any requests. The journey wasn’t easy, but undeniably worth it. I’ll share how we made this possible by leveraging Nginx, Docker, and our open-source deployment tool Shipit.

Tuesday, March 14, 2017 at 3:50 pm to 4:45 pm, Track 2

How Three Changes Led to Big Increases in On-call Health - Dale Neufeld

Burnout, unfortunately, is commonplace in operations, and its negative effects are well-documented. However, it doesn’t have to be inevitable. Recently, we realized that action had to be taken to establish a better on-call experience, including moving to a production engineering model. I’ll share specific actions that not only helped to keep our team healthy but also grew people’s expertise.

Tuesday, March 14, 2017 at 3:50 pm to 4:45 pm, Track 2

Booth Presence

You can also catch us at our booth (#3, in the Grand Foyer), where we'll be hosting office hours, chatting about topics related to our talks.

Monday, March 13:
9:55 am to 12:50 pm - Handling Massive Flashes of High-Write Traffic
12:50 pm to 3:45 pm - Road to an SRE Model
3:45 pm to 6:30 pm - Tools for Tracking Service Infrastructure at Scale

Tuesday, March 14:
9:55 am to 12:50 pm - Auto-deploying Anywhere and At Any Time
12:50 pm to 3:45 pm - Road to an SRE Model
3:45 pm to 7:00 pm - Automating Data Center Deployments

Anita Clarke

Continue reading →

Surviving Flashes of High-Write Traffic Using Scriptable Load Balancers (Part II)

In the first post of this series, I outlined Shopify’s history with flash sales, our move to Nginx and Lua to help manage traffic, and the initial attempt we made to throttle traffic that didn’t account sufficiently for customer experience. We had underestimated the impact of not giving preference to customers who’d entered the queue at the beginning of the sale, and now we needed to find another way to protect the platform without ruining the customer experience.

Emil Stolarsky

Continue reading →

Surviving Flashes of High-Write Traffic Using Scriptable Load Balancers (Part I)

This Sunday, over 100 million viewers will watch the Super Bowl. Whether they’re catching the match-up between the Falcons and the Patriots, or there for the commercials between the action, that’s a lot of eyeballs—and that’s only counting America. But all that attention doesn’t just stay on the screen, it gets directed to the web, and if you’re not prepared curious visitors could be rewarded with a sad error page.

The Super Bowl makes us misty-eyed because our first big flash sale happened in 2007, after the Colts beat the Bears. Fans rushed online for T-shirts celebrating the win, giving us a taste of what can happen when a flood of people convene on one site in a very short duration of time. Since then, we’ve been continually levelling up our ability to handle flash sales, and our merchants have put us to the test: on any given day, they’ll hurl Super Bowl-sized traffic, often without notice.

My name is Emil Stolarsky and I work on the Performance and Capacity Planning team at Shopify. This series (with part one today, and part two next week) shares the problems we faced due to overwhelming traffic from flash sales and the thrifty (and nifty!) solution we created that allowed merchants to continue running sales without requiring a major overhaul of our platform.

While not every company faces flash sales, many need to handle high-traffic events that can overload their system, and we hope this post provides inspiration for solutions that can be implemented with a small team and some elbow grease.

Emil Stolarsky

Continue reading →

Why Shopify Moved to The Production Engineering Model

The traditional model of running large-scale computer systems divides work into Development and Operations as distinct and separate teams. This split works reasonably well for computer systems that are changed or updated very rarely, and organizations sometimes require this if they’re deploying and operating software built by a different company or organization. However, this rigid divide fails for large-scale web applications that are undergoing frequent or even continuous change. DevOps is the term for a movement that’s gathered steam in the past decade to bring together these disciplines.

Until about a year ago, Shopify followed the traditional model and felt the pain of having ownership separated across teams. Developers were responsible for deploying changes, while three separate teams owned scaling, monitoring, and maintaining the runtime infrastructure respectively. Having many distinct teams with sometimes divergent goals trying to run the same production system created short-term chaos and made it hard to align on long-term goals.

We thought carefully about how to solve this problem in the right way. Running a large-scale web platform requires very deep operational skills in key areas such as networking, data storage, server management, scaling infrastructure, and transaction processing, so Shopify still required people dedicated to expertise in these areas. On the other hand, the company was building out products and features at blistering speed, so we couldn't accept any kind of organizational or technical barriers that would slow the rate of innovation.


John Arthorne

Continue reading →

Shopify heads to Dublin for SREcon Europe

Production engineers from Shopify will be crossing the pond to speak at SREcon Europe from July 11 to 13, 2016 in Dublin, Ireland. From flash sale engineering to fuzz testing to multi-tenant architecture across multiple data centers, we got you covered!

Image credit: Giuseppe Milo

Jaime Woo

Continue reading →