Schematizing Deletion at Scale
Share
At Shopify, we analyze a variety of events from our buyers and merchants to improve their experience and the platform, and empower their decision making. These events are collected via our streaming platform (Kafka) and stored in our data warehouse that houses event data at a rate of tens of billions of events per day. The image below depicts how these events have historically been collected, stored in our data warehouse, and used in other online dashboards.
We set out to enhance the technical organization of our systems to enhance the reliability, performance, and efficiency of data processing for purposes of effecting deletion. The Privacy team and the Data Science & Engineering teams collaborated and addressed those challenges together, achieving long-term benefits. The rest of this blog post focuses on our collaboration efforts and the technical challenges we faced when addressing these issues in a large organization such as Shopify.
Context Collection
Lack of guaranteed schemas for events was the root cause of a lot of our challenges. To address this, we designed a schematization system that specified the structure of each event including types of each field, evolution (versions) context, ownership, as well as privacy context. The privacy context specifically includes marking sensitive data, identifying data subjects, and handling PII ( that is, what to do with PII).
Schemas are designed by data scientists or developers interested in capturing a new kind of event (or changing an existing one). They’re proposed in a human readable JSON format and then reviewed by team members for accuracy and privacy reasons. As of today, we have more than 4500 active schemas. This schema information is then used to enforce and guarantee the structure of every single event going through our pipeline at generation time.
Above shows a trimmed signup event schema. Let’s read through this schema and see what we learn from it:
The privacy_setting section specifies whose PII this event includes by defining a data controller and data subject. Data controller indicates the entity that decides why and how personal data is processed (Shopify in this example). Data subject designates whose data is being processed that’s tracked via email (of the person in question) in this schema.
Every field in a schema has a data-type and doc field, and a privacy block indicating if it contains sensitive data. The privacy block indicates what kind of PII is being collected under this field and how to handle that PII.
Our new schematization platform was successful in capturing the aforementioned context and it significantly increased privacy education and awareness among our data scientists and developers because of discussions on schema proposals about identifying personal data fields. This platform also helped with reusability, observability, and streamlining common tasks for the data scientists too. Our schematization platform signified the importance of capitalizing on shared goals across different teams in a large organization.
Personal Data Handling
At this point, we have schemas that gather all the context we need regarding structure, ownership, and privacy for our analytical events. The next question is how to handle and track personal information accurately in our data warehouse.
We perform two types of transformation on personal data before entering our data warehouse. These transformations convert personal (identifying) data to non-personal (non-identifying) data. In particular, we employ two types of pseudonymisation techniques: Obfuscation and Tokenization.
Obfuscation and Enrichment
When we obfuscate an IP address, we mask half of the bytes but include geolocation data at city and country level. In most cases, this is how the raw IP address was intended to be used for in the first place. This had a big impact on adoption of our new platform, and in some cases offered added value too.
In obfuscation, identifying parts of data are either masked or removed so the people whom the data describe remain anonymous. This often removes the need for storing personal data at all. However, a crucial point is to preserve the analytical value of these records in order for them to stay useful.
Looking at different types of PII and how they’re used, we quickly observed patterns. For instance, the main use case of a full user agent string is to determine operating system, device type, and major version that are shared among many users. But a user agent can contain very detailed identifying information including screen resolution, fonts installed, clock skew, and other bits that can identify a data subject. So, during obfuscation, all identifying bits are removed and replaced with generalized aggregate level data that data analysts seek. The table below shows some examples of different PII types and how they’re obfuscated.
PII Type |
Raw Form |
Obfuscated |
IP Address |
207.164.33.12 |
{ "masked": "207.164.0.0", "geo_country": "Canada" } |
User agent |
CPU iPhone OS 9_3_2 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Mobile/13F69 Instagram 8.4.0 (iPhone7,2; iPhone OS 9_3_2; nb_NO; nb-NO; scale=2.00; 750x1334 |
{ "Family": "Instagram", "Major": "8", } |
Latitude/Longitude |
45.4215° N, 75.6972° W |
45.4° N, 75.6° W |
|
john@gmail.com behrooz@example.com |
REDACTED@gmail.com REDACTED@REDACTED.com |
Tokenization
Obfuscation is irreversible (the original PII is gone forever) and doesn’t suit every use case. There are times when data scientists require access to the actual raw data. To address these needs, we built a tokenization engine that exchanges PII with a consistent random token. We then store tokens in the data warehouse. A separate secured vault service is in charge of storing the token to PII mapping. This way, to effect deletion only the mapping in the vault service needs removing and all the copies of that corresponding token across the data warehouse become effectively non-detokenizable (in other words, just a random string).
To understand the tokenization process better let’s go through an example. Let’s say Hooman is a big fan of AllBirds and GymShark products, and he purchases two pairs of shoes from AllBirds and a pair of shorts from GymShark to hit the trails! His purchase data might look like the table below before tokenization:
Email
|
Shop
|
Product
|
...
|
hooman@gmail.com
|
allbirds
|
Sneaker
|
|
hooman@gmail.com
|
Gymshark
|
Shorts
|
|
hooman@gmail.com
|
allbirds
|
Running Shoes
|
|
|
Shop |
Product |
... |
Token123 |
allbirds |
Sneaker |
|
Token456 |
Gymshark |
Shorts |
|
Token123 |
allbirds |
Running Shoes |
There are two important observations in the after tokenization table:
- The same PII (hooman@gmail.com) was replaced by the same token(Token123) under the same data controller (allbirds shop) and data subject (Hooman). This is the consistency property of tokens.
- On the other hand, the same PII (hooman@gmail.com) got a different token (Token456) under a different data controller (merchant shop) even though the actual PII remained the same. This is the multi-controller property of tokens and allows data subjects to exercise their rights independently among different data controllers (merchant shops). For instance, if Hooman wants to be forgotten or deleted from allbirds, that shouldn’t affect his history with Gymshark.
Now let’s take a look inside how all of this information is stored within our tokenization vault service shown in table below.
Data Subject
|
Controller | Token | PII |
hooman@gmail.com
|
allbirds
|
Token123
|
hooman@gmail.com |
hooman@gmail.com
|
Gymshark |
Token456
|
hooman@gmail.com |
...
|
...
|
...
|
...
|
The vault service holds token to PII mapping. It uses this context to decide whether to generate a new token for the given PII or reuse the existing one. The consistency property of tokens allows data scientists to perform analysis without requiring access to the raw data. For example, all orders of Hooman from GymShark could be tracked only by looking for Token456 across the orders tokenized dataset.
Now back to our original goal, let’s review how all of this helps with deletion of PII in our data warehouse. If the data in our warehouse is obfuscated and tokenized, essentially there will be nothing left in the data warehouse to delete after removing the mapping from the tokenization vault. To understand this let’s go through some examples of deletion requests and how it will affect our datasets as well as tokenization vault.
Data Subject | Controller | Token | PII |
hooman@gmail.com | allbirds | Token123 | hooman@gmail.com |
hooman@gmail.com | Gymshark | Token456 | hooman@gmail.com |
hooman@gmail.com | Gymshark | Token789 | 222-333-4444 |
eva@hotmail.com | Gymshark | Token011 | IP 76.44.55.33 |
DataSubject == ‘hooman@gmail.com’ AND Controller == Gymshark
Which results in deletion of the rows identified with a star (*) in the table below:
Data Subject | Controller | Token | PII |
hooman@gmail.com | allbirds | Token123 | hooman@gmail.com |
* hooman@gmail.com | Gymshark | Token456 | hooman@gmail.com |
* hooman@gmail.com | Gymshark | Token789 | 222-333-4444 |
eva@hotmail.com | Gymshark | Token011 | IP 76.44.55.33 |
Data Subject | Controller | Token | PII |
* hooman@gmail.com | allbirds | Token123 | hooman@gmail.com |
* hooman@gmail.com | Gymshark | Token456 | hooman@gmail.com |
* hooman@gmail.com | Gymshark | Token789 | 222-333-4444 |
eva@hotmail.com | Gymshark | Token011 | IP 76.44.55.33 |
Notice in all of these examples, there was nothing to do in the actual data warehouse since once the mapping of token ↔ PII is deleted, tokens effectively become consistent random strings. In addition, all of these operations can be done in fractions of a second whereas doing any task in a petabyte scale data warehouse can become very challenging, and time and resource consuming.
Schematization Platform Overview
So far we’ve learned about details of schematization, obfuscation, and tokenization. Now it’s time to put all of these pieces together in our analytical platform. The image below shows an overview of the journey of an event from when it’s fired until it’s stored in the data warehouse:
In this example:
- A SignUp event is triggered into the messaging pipeline (Kafka)
- A tool, Scrubber, intercepts the message in the pipeline and applies pseudonymisation on the content using the predefined schema fetched from the Schema Repository for that message
- The Scrubber identifies that the SignUp event contains tokenization operations too. It then sends the raw PII and Privacy Context to the Tokenization Vault.
- Tokenization Vault exchanges PII and Privacy Context for a Token and sends it back to the Scrubber
- Scrubber replaces PII in the content of the SignUp event with the Token
- The new anonymized and tokenized SignUp event is put back onto the message pipeline.
- The new anonymized and tokenized SignUp event is stored in the Data warehouse.
Lessons from Managing PII at Shopify Scale
Despite having a technical solution for classifying and handling PII in our data warehouse, Shopify scale made adoption and reprocessing of our historic data a difficult task. Here are some lessons that helped us in this journey.
Adoption
Having a solution versus adopting it are two different problems. Given the scale of Shopify, collaborating with all stakeholders to implement this new tooling required intentional engagement and productive communication, particularly in light of the significant changes proposed. Let’s review a few factors that significantly helped us.
Make the Wrong Thing the Hard Thing
Make the right thing the default option. A big factor in the success and adoption of our tooling was to make our tooling the default and easy option. Nowadays, creating and collecting unstructured analytical events at Shopify is difficult and goes through a tedious process with several layers of approval. Whereas creating structured privacy-aware events is a quick, well documented, and automated task.
“Trust Me, It Will Work” Isn’t Enough!
Proving scalability and accuracy of the proposed tooling was critical to building trust in our approach. We used the same tooling and mechanisms that the Data Science & Engineering team uses to prove correctness, reconciliation. We showed the scalability of our tooling by testing it on real datasets and stress testing under order of magnitudes higher load.
Make Sure the Tooling Brings Added Value
Our new tooling is not only the default and easy way to collect events, but also offers added value and benefits such as:
- Shared privacy education: Our new schematization platform encourages asking about and discussing privacy concerns. They range from what’s PII to other topics like what can or can’t be done with PII. It brings clarity and education that wasn’t easily available before.
- Increased dataset discoverability: Schemas for events allow us to automatically integrate with query engines and existing tooling, making datasets quick to be used and explored.
These examples are a big driver and encouragement in adoption of our new toolings.
Capitalizing on Shared Goals
Schematization isn’t only useful for privacy reasons, it helps with reusability and observability, reduces storage cost, and streamlines common tasks for the data scientists too. Both privacy and data teams are important stakeholders in this project and it made collaboration and adoption a lot easier because we capitalized on shared goals across different teams in a large organization.
Historic Datasets are several petabytes of historic events collected in our data warehouse prior to the schematization platform.
There is intricate interdependency between some of the analytical jobs depending on these datasets. Similar to adoption challenges, there’s no easy solution for this problem, but here are some practices that helped us in mitigating this challenge.
Organizational Alignment
Any task of this scale goes beyond the affected individuals, projects, or even teams. Hence an organizational commitment and alignment is required to get it done. People, teams, priorities, and projects might change, but if there’s organizational support and commitment for addressing privacy issues, the task can survive. Organizational alignment helped us to put out consistent messaging to various team leads that meant everyone understood the importance of the work. With this alignment in place, it was usually just a matter of working with leads to find the right balance of completing their contributions in a timely fashion without completely disrupting their roadmap.
Dedicated Task Force
These kinds of projects are slow and time consuming. We understood the importance of having a team and project, so we didn’t depend on individuals. People come and go, but the project must carry on.
Tooling, Documentation, and Support
One of our goals was to minimize the amount of effort individual dataset owners and users needed to migrate their datasets to the new platform. We documented the required steps, built automation for tedious tasks, and created integrations with tooling that data scientists and librarians were already familiar with. In addition, having Engineering support with hurdles was important. For instance, on many occasions when performance or other technical issues came up, Engineering support was available to solve the problem. Time spent on building the tooling, documentation, and support procedures easily paid off in the long time run.
Regular Progress Monitoring
Questioning dependencies, priorities, and blockers regularly paid off because we found better ways. For instance, in a situation where x is considered a blocker for y maybe:
- we can ask the team working on x to reprioritize and unblock y earlier.
- both x and y can happen at the same time if the teams owning them align on some shared design choices.
- there's a way to reframe x or y or both so that the dependency disappears.
We were able to do this kind of reevaluation because we had regular and constant progress monitoring to identify blockers.
New Platform Operational Statistics
Our new platform has been in production use for over two years. Nowadays, we have over 4500 distinct analytical schemas for events, each designed to capture certain metrics or analytics, and with their own unique privacy context. On average, these schemas generate roughly 20 billions events per day or approximately 230K events per second with peaks of over 1 million events per second during busy times. Every single one of these events is processed by our obfuscation and tokenization tools in accordance to its privacy context before being accessible in the data warehouse or other places.
Our tokenization vault holds more than 500 billions distinct PII to token mappings (approximately 200 TeraBytes) from which tens to hundreds of millions are deleted daily in response to deletion. The magical part of this platform is that deletion happens instantaneously in the tokenization vault without requiring any operation in the data warehouse. This is the super power that enables us to delete data that used to be very difficult to identify. These metrics proved the efficiency and scalability of our approach and new tooling.
As part of onboarding our historic datasets into our new platform, we rebuilt roughly 100 distinct datasets (approximately tens of petabytes of data in total) feeding hundreds of jobs in our analytical platform. Development, rollout, and reprocessing of our historical data altogether took about three years with help from 94 different individuals signifying the scale of effort and commitment that we put into this project.
We believe sharing the story of a metamorphosis in our data analytics platform is valuable because when we looked for industry examples, there were very few available. In our experience, schematization and a platform to capture the context including privacy and evolution is beneficial in analytical event collection systems. They enable a variety of opportunities in treating sensitive information and educating developers and data scientists on data privacy. In fact, our adoption story showed that people are highly motivated to respect privacy when they have the right tooling at their disposal.
Tokenization and obfuscation proved to be effective tools in helping with handling, tracking and deletion of personal information. They enabled us to efficiently delete data at a very large scale.
Finally, we learned that solving technical challenges isn’t the entire problem. It remains a tough problem to address organizational challenges such as adoption and dealing with historic datasets. We learned that bringing new value, capitalizing on shared goals, streamlining and automating processes, and having a dedicated task force to champion these kinds of big cross team initiatives are effective and helpful techniques.
Additional Information
Behrooz is a staff privacy engineer at Shopify where he works on building scalable privacy tooling and helps teams to respect privacy. He received his MSc in Computer Science at University of Waterloo in 2015. Outside of the binary world, he enjoys being upside down (gymnastics) 🤸🏻, on a bike 🚵🏻♂️ , on skis ⛷, or in the woods. Twitter: @behroozshafiee
Shipit! Presents: Deleting the Undeletable
On September 29, 2021, Shipit!, our monthly event series, presented Deleting the Undeletable. Watch Behrooz Shafiee and Jason White as they discuss the new schematization platform and answer your questions.
Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.