We were looking for a reliable way to collect event data and send it to our data warehouse.
We were considering a more service-oriented architecture, and needed a standardized way of message passing between the components.
We were starting to evaluate containerization of Shopify, and were searching for a way to get logs out of containers.
We were intrigued by Kafka due to its highly available design. However, Kafka runs on the JVM, and its primary user, LinkedIn, runs a full JVM stack. Shopify is mainly Ruby on Rails and Go, so we had to figure out how to integrate Kafka into our infrastructure.
Avoiding a direct link between Ruby and Kafka
The straight forward solution is to emit events directly from Shopify, buffer them on a queue and have a thread work off them—this is what LinkedIn does. However, at the time we had zero experience operating Kafka, so we were hesitant to go with an approach that would make it impossible to deploy during Kafka downtime without losing data, because of full queues in process memory. Furthermore, there was no great Ruby library around and we'd already built a the Sarama Go library for Kafka. On top of Sarama we built a producer to sit alongside Shopify to deliver events to Kafka.
The challenge proved to be reliably delivering events from Rails to the Go producer. We did not trust our ability to operate Kafka as a high availability system just yet, we needed a solution that allowed for downtime without data loss. Our first experiments were flat files and syslog—but receiving from the other end was complex and unreliable. Eventually, we settled for the ancient IPC mechanic SysV message queues. We chose SysV MQs over POSIX MQs because at the time we were developing on OS X, which has not implemented POSIX MQs.
The SysV message queue gave us a system-level buffer owned by the kernel, which allowed either end of the queue to restart at anytime, making the deploy logic extremely simple. We wrote small wrappers for Go and Ruby. This turned out to work well for our purposes. If Kafka ever went down, the maximum queue size is sufficient to store two hours worth of events. The size of the queue is closely monitored, but stays at just a few kbs for the most part.
Adding Docker to the mix
A few months later our team took on the effort to containerize Shopify with Docker. Containers isolate the IPC namespace by default, which means the container and the host can not share a SysV queue. That means running a producer inside every container.
At Shopify, we strive to keep our containers as minimal as possible—we think of them as a strong isolation layer for a single process, not as a complete virtual machine. That means we want to avoid the operational nightmare of running a Kafka producer inside thousands of containers that are starting and stopping constantly. The queue must always be empty before a container shuts down, which would make it impossible to deploy in pressured situations or in the event of brief Kafka downtime. It would also mean thousands of additional connections to Kafka, which may or may not be fine, but it’s not a load on Kafka we have experience with.
With that option out of the way, we decided to follow the convention we were already using with StatsD and run the service on the host and communicate with it through TCP. We wrote a small TCP to SysV MQ proxy to run alongside the Kafka producer to give us many of the same guarantees as we have had in the past, but allowing the container to send events over the network. Would we have started over today wanting a similar setup, we probably would’ve written or used something like event-shuttle.