Mitigating Deadlocks in high concurrency environments

Mitigating Deadlocks in high concurrency environments

The problem

In order to guarantee data integrity, MySQL requires locks prior to updating existing records. If multiple processes update data in the same table in parallel, deadlocks can occur depending on how the data is stored in disk and managed by MySQL. That can happen even if the different processes target different records. This blog post explores the problem of deadlocks in MySQL while updating data, and illustrates how composite primary keys can effectively mitigate this issue.

Background

When up-inserting a batch of records that combines new and existing data, MySQL will try to insert all the records. For the ones that already exist, a collision on a unique constraint will indicate that an update rather than an insert is required. In order to update those records MySQL  requires a gap lock so that it can do so safely.

When this process is repeated concurrently via multiple processes it’s prone to deadlocks.

That happens because a gap lock will target the actual record being updated and the one right before it in the index structure. MySQL does that so that it can also update the index.

If the `pkid` is sequential it means that the data will also be stored in the same order of that sequence. When requiring the gap locks to manage the multiple processes updating the data, there could be overlaps in the gap locks required by different processes. Depending on the number of processes in parallel and the number of unique constraint collisions it can lead to deadlocks. Depending on the frequency of these deadlocks they could affect the performance of the application significantly. Deadlocks can happen even if the processes are operating on a different batch of records.

The illustration depicts an example table called `transactions`. This table is updated in parallel by ETL processes that operate on a per `account` basis. Despite the fact that each process is targeting a different set of `transaction` records, the lock requirements amongst them can overlap and cause deadlocks.

The solution

We have come across a blog post that made use of composite primary keys to obtain performance gains during read time. That made us realize that modifying the way the data is stored in disk to match the pattern in which the data is inserted and updated could also serve as a way to mitigate deadlocks within our context. Gap locks will still be required by MySQL to manage the collisions and keep data integrity, however with this approach we can compartmentalize the data in such a way that the gap locks required stays within each process instance, reducing the chances of overlapping gaps dramatically.

That was accomplished by the use of a composite primary key. The primary key is what determines how the data is stored in disk, the InnoDB engine physically orders the data based on the values of the primary key.

The processes that update the data were already partitioned by `account`, therefore that same column was added to the new composite key alongside with the existing `id`. The benefit of this solution is that it does not require any logic change in the application layer, as long as we can still reference the original id as a unique identifier for each record.

Because of that we chose to add an extra unique index on the `id` so that it could still be used to uniquely identify a record on its own. That allows for backward compatibility with any existing `joins` in the application that only uses the `id` column.

This illustration shows how the same ETL processes interact with the data given its new composite primary key. The logic on the ETL jobs remained the same, however because the data is now stored in disk on a per `account` basis there is no more overlap on the required locks.

Conclusion

This approach enabled us to solve the problem in the DB layer without any required changes in logic on the application level. With this we were able to remove artificial delays on our processes that were originally used to mitigate the deadlocks. The overall throughput of our import jobs increased significantly by means of reducing the wait time of those jobs.

Back to blog