An important part of any technical design is choosing where to store your data. Does it conform to a schema or is it flexible in structure? Does it need to stick around forever or is it temporary?
In this article, we’ll describe five common data stores and their attributes. We hope this information will give you a good overview of different data storage options so that you can make the best possible choices for your technical design.
The five types of data stores we will discuss are
- Relational database
- Non-relational (“NoSQL”) database
- Key-value store
- Full-text search engine
- Message queue
Databases are, like, the original data store. When we stopped treating computers like glorified calculators and started using them to meet business needs, we started needing to store data. And so we (and by we, I mean Charles Bachman) invented the first database management system in 1963. By the mid to late ‘70s, these database management systems had become the relational database management systems (RDBMSs) that we know and love today.
A relational database, or RDB, is a database which uses a relational model of data.
Data is organized into tables. Each table has a schema which defines the columns for that table. The rows of the table, which each represent an actual record of information, must conform to the schema by having a value (or a NULL value) for each column.
Each row in the table has its own unique key, also called a primary key. Typically this is an integer column called “ID.” A row in another table might reference this table’s ID, thus creating a relationship between the two tables. When a column in one table references the primary key of another table, we call this a foreign key.
Using this concept of primary keys and foreign keys, we can represent incredibly complex data relationships using incredibly simple foundations.
SQL, which stands for structured query language, is the industry standard language for interacting with relational databases.
At Shopify, we use MySQL as our RDBMS. MySQL is durable, resilient, and persistent. We trust MySQL to store our data and never, ever lose it.
Other features of RDBMSs are
- Replicated and distributed (good for scalability)
- Enforces schemas and atomic, consistent, isolated, and durable (ACID) transactions (leads to well-defined, expected behavior of your queries and updates)
- Good, configurable performance (fast lookups, can tune with indices, but can be slow for cross-table queries)
When to Use a Relational Database
Use a database for storing your business critical information. Databases are the most durable and reliable type of data store. Anything that you need to store permanently should go in a database.
Relational databases are typically the most mature databases: they have withstood the test of time and continue to be an industry standard tool for the reliable storage of important data.
It’s possible that your data doesn’t conform nicely to a relational schema or your schema is changing so frequently that the rigid structure of a relational database is slowing down your development. In this case, you can consider using a non-relational database instead.
Non-Relational (NoSQL) Database
Computer scientists over the years did such a good job of designing databases to be available and reliable that we started wanting to use them for non-relational data as well. Data that doesn’t strictly confirm to some schema or that has a schema which is so variable that it would be a huge pain to try to represent it in relational form.
These non-relational databases are often called “NoSQL” databases. They have roughly the same characteristics as SQL databases (durable, resilient, persistent, replicated, distributed, and performant) except for the major difference of not enforcing schemas (or enforcing only very loose schemas).
NoSQL databases can be categorized into a few types, but there are two primary types which come to mind when we think of NoSQL databases: document stores and wide column stores.
(In fact, some of the other data stores below are technically NoSQL data stores, too. We have chosen to list them separately because they are designed and optimized for different use cases than these more “traditional” NoSQL data stores.)
A document store is basically a fancy key-value store where the key is often omitted and never used (although one does get assigned under the hood—we just don’t typically care about it). The values are blobs of semi-structured data, such as JSON or XML, and we treat the data store like it’s just a big array of these blobs. The query language of the document store will then allow you to filter or sort based on the content inside of those document blobs.
A popular document store you might have heard of is MongoDB.
Wide Column Store
A wide column store is somewhere in between a document store and a relational DB. It still uses tables, rows, and columns like a relational DB, but the names and formats of the columns can be different for various rows in the same table. This strategy combines the strict table structure of a relational database with the flexible content of a document store.
At Shopify, we use Bigtable as a sink for some streaming events. Other NoSQL data stores are not widely used. We find that the majority of our data can be modeled in a relational way, so we stick to SQL databases as a rule.
When to use a NoSQL Database
Non-relational databases are most suited to handling large volumes of data and/or unstructured data. They’re extremely popular in the world of big data because writes are fast. NoSQL databases don’t enforce complicated cross-table schemas, so writes are unlikely to be a bottleneck in a system using NoSQL.
Non-relational databases offer a lot of flexibility to developers, so they are also popular with early-stage startups or greenfield projects where the exact requirements are not yet clear.
Another way to store non-relational data is in a key-value store.
A key-value store is basically a production-scale hashmap: a map from keys to values. There are no fancy schemas or relationships between data. No tables or other logical groups of data of the same type. Just keys and values, that’s it.
Both Redis and Memcached are in-memory key-value stores, so their performance is top-notch.
Since they are in-memory, they (necessarily) support configurable eviction policies. We will eventually run out of memory for storing keys and values, so we’ll need to delete some. The most popular strategies are Least Recently Used (LRU) and Least Frequently Used (LFU). These eviction policies make key-value stores an easy and natural way to implement a cache.
(Note: There are also disk-based key-value stores, such as RocksDB, but we have no experience with them at Shopify.)
One major difference between Redis and Memcached is that Redis supports some data structures as values. You can declare that a value in Redis is a list, set, queue, hash map, or even a HyperLogLog, and then perform operations on those structures. With Memcached, everything is just a blob and if you want to perform any operations on those blobs, you have to do it yourself and then write it back to the key again.
Redis can also be configured to persist to disk, which Memcached cannot. Redis is therefore a better choice for storing persistent data, while Memcached remains only suitable for caches.
When to use a Key-Value Store
Key-value stores are good for simple applications that need to store simple objects temporarily. An obvious example is a cache. A less obvious example is to use Redis lists to queue units of work with simple input parameters.
Full-Text Search Engine
Search engines are a special type of data store designed for a very specific use case: searching text-based documents.
Technically, search engines are NoSQL data stores. You ship semi-structured document blobs into them, but rather than storing them as-is and using XML or JSON parsers to extract information, the search engine slices and dices the document contents into a new format that is optimized for searching based on substrings of long text fields.
Search engines are persistent, but they’re not designed to be particularly durable. You should never use a search engine as your primary data store! It should be a secondary copy of your data, which can always be recreated from the original source in an emergency.
At Shopify we use Elasticsearch for our full-text search. Elasticsearch is replicated and distributed out of the box, which makes it easy to scale.
The most important feature of any search engine, though, is that it performs exceptionally well for text searches.
To learn more about how full-text search engines achieve this fast performance, you can check out Toria’s lightning talk from StarCon 2019.
When to use a Full-Text Search Engine
If you have found yourself writing SQL queries with a lot of wildcard matches (for example, “SELECT * FROM products WHERE description LIKE “%cat%” to find cat-related products) and you’re thinking about brushing up on your natural-language processing skills to improve the results… you might need a search engine!
Search engines are also pretty good at searching and filtering by exact text matches or numeric values, but databases are good at that, too. The real value add of a full-text search engine is when you need to look for particular words or substrings within longer text fields.
The last type of data store that you might want to use is a message queue. It might surprise you to see message queues on this list because they are considered more of a data transfer tool than a data storage tool, but message queues store your data with as much reliability and even more persistence than some of the other tools we’ve discussed already!
At Shopify, we use Kafka for all our streaming needs. Payloads called “messages” are inserted into Kafka “topics” by “producers.” On the other end, Kafka “consumers” can read messages from a topic in the same order they were inserted in.
Under the hood, Kafka is implemented as a distributed, append-only log. It’s just files! Although not human-readable files.
Kafka is typically treated as a message queue, and rightly belongs in our message queue section, but it’s technically not a queue. It’s technically a distributed log, which means that we can do things like set a data retention time of “forever” and compact our messages by key (which means we only retain the most recent value for each key) and we’ve basically got a key-value document store!
Although there are some legitimate use cases for such a design, if what you need is a key-value document store, a message queue is probably not the best tool for the job. You should use a message queue when you need to ship some data between services in a way that is fast, reliable, and distributed.
When to use a Message Queue
Use a message queue when you need to temporarily store, queue, or ship data.
If the data is very simple and you’re just storing it for use later in the same service, you could consider using a key-value store like Redis. You might consider using Kafka for the same simple data if it’s very important data, because Kafka is more reliable and persistent than Redis. You might also consider using Kafka for a very large amount of simple data, because Kafka is easier to scale by adding distributed partitions.
Kafka is often used to ship data between services. The producer-consumer model has a big advantage over other solutions: because Kafka itself acts as the message broker, you can simply ship your data into Kafka and then the receiving service can poll for updates. If you tried to use something more simple, like Redis, you would have to implement some kind of notification or polling mechanism yourself, whereas Kafka has this built-in.
These are not the be-all-end-all of data stores, but we think they are the most common and useful ones. Knowing about these five types of datastores will get you on the path to making great design decisions!
What do you think? Do you have a favourite type of datastore that didn’t make it on the list? Let us know in the comments below.
We’re always looking for awesome people of all backgrounds and experiences to join our team. Visit our Engineering career page to find out what we’re working on.