Hi everyone!

Most of you have probably heard of an acronym ACID which is a property of a transactional system e.g. database meaning it can provide A — Atomicity, C — Consistency, I — Isolation and D — Durability... But how does it work exactly?

This is a very tough question and countless people have already tried to answer it in numerous books and research articles. In fact each of these properties hides so much complexity in it that you could spend your entire life learning about only a single of those properties e.g. about different ways to preserve durability or provide isolation.

However, some of the algorithms and data structures turns out to be so useful and impactful that we still use them in production databases and base our research on top of them. On of such algorithms is ARIES. Today I'm going to provide you a deep-dive into internals of databases and how they use ARIES to store and recover data in case of failures.

What is recovery and why should you care anyway?

First thing first, a recovery is a process which happens in a database when it stops, usually abruptly. There are countless reasons why it happens. It could be a sudden power surge which forces a server to shut down. It could be programming error on database itself or in operating system which runs it or even in hardware used by it. Important part is that database will stop working sooner or later. And when it happens it better to still have all the data you tried to save there after restarting.

What is ARIES?

ARIES is an algorithm for data recovery used in transactional systems based on write-ahead logging (WAL) technique. It supports both full and partial transaction rollback, can work with multiple buffer manager (a part of database which create a mapping between volatile e.g. RAM and non-volatile e.g. HDD storage) policies while also providing fine-grained locking up until locking of individual rows. ARIES uses full history repetition to achieve it's goals. This design decision makes ARIES both easy to understand while also allowing it to work even when multiple consecutive failures occur without a change to fully recover.

ARIES or at least ideas from it are used in most recovery algorithms used by databases which store and update data in-place such as Oracle DBMS, Microsoft SQL server and IBM DB2. A very similar approach is also used in PostgresSQL and some other relational databases. For a work which was published in 1992 staying strong in twenty first century is a hell of an achievment!

ACID и ARIES

I would guess that most of you have heard another abbreviation commonly used when describing databases which is ACID. ACID stands for A — Atomicity, C — Consistency, I — Isolation and lastly D — Durability. While there is a lot meaning in each of this words the abbreviation itself does not apply to components of database directly. For example, it's hard to find a component of database which is strictly responsible for durability and have nothing to do with consistency. Databases generally consist of different parts and processes which as a whole make them what they are — a durable and effective way to store, find and update data.

ARIES (Algorithm for Recovery and Isolation Exploiting Semantics) is one of the most influential and dense algorithms for data recording and recovery. The work (pdf) dates back to 1992, but similar algorithms have existed since about the 60-70s, when the first databases just appeared. Ideas from ARIES are still used everywhere, and the work itself has already been cited more than 1,500 times (according to google scholar)! Almost any other work about transactionality or recovery mentions ARIES.

The authors talk not only about the problems of data recovery, but also about the impact the recording algorithm has on the ability of the database to ensure the granularity of locks (the less data one lock closes, the more data can be changed in parallel), how to implement only partial data rollback, what to do if the recovery process is interrupted, and how to simultaneously accept new transactions while the old ones have not yet recovered. It also provides a comparison with the existing solutions at that time, and shows how they are worse. I will omit the last part, but this is also very interesting reading. By the way, due to constant comparisons with other databases and algorithms, this work is very difficult to read, although the algorithm itself is quite simple.

Database from a bird's-eye view. Working with memory and disk

To begin with, a few paragraphs about how databases work, if you look very far away, but a little closer than when working with database clients in programming languages. Fundamentally, a database is exactly the same process or group of processes running on the operating system as our favorite web servers or applications on the phone. We will omit distributed databases, everything is the same there, only a billion times more complicated. To begin with, about memory and disk in the database, and then how the database prevents data loss by logging.

As in any other process, part of the database data is in volatile memory, it's RAM, it's just "memory". At the same time, the main task of the database is to save data to non-volatile memory: to HDD, SDD, tape drives (tape drive), depending on what you have at hand there and what year it is now. Non-volatile devices are usually much slower to read, and even more so to write data than memory. So slow that the database has absolutely no way to translate read and write requests directly to disk. It turns out a dilemma. On the one hand, the database must save everything to disk so that data is not lost. On the other hand, some of the data needs to be stored in memory because the disks are too slow. And the information from the memory disappears as soon as the light turns off or the process with the database dies.

The most straightforward way out of this situation is to store a memory buffer in the database, displayed periodically one-to-one to disk. The solution is not ideal, since the volume of disks is usually many times more than the amount of memory, it will not be possible to store everything. In addition, it is expensive to write and read the entire buffer directly at once. It's really not possible to synchronize 32+ GB of RAM at a time. But there is a way out: to store only a certain number of pages (page) in memory. Pages are usually made of the same size — it's easier to work with them this way. Here we proceed from the suggestion that all the data at once is most likely not needed, so you can store only the part that is currently being worked on.

It turns out that the database stores a certain set of pages displayed 1-to-1 on disk sections. And now feint with your ears. So we pulled the page out of the disk and store it in memory. Since the only way to change the contents of the disk is to write a page before this happens, our page is an accurate representation of the data on the disk. And it is stored in memory. This means that any request to the data on the disk can be addressed to this very page in memory. This is from several times to several orders of magnitude faster!

Now about the recording. As long as the data on the page has not changed, i.e. as long as the page remains "clean" - you can not even go to disk, since the data is in memory. As soon as the data on the page has changed, it becomes "dirty" (dirty page). Such a page needs to be written to the hard disk so that it becomes clean again. A component called buffer manager (hereinafter referred to as BM) is engaged in writing and reading pages from memory and vice versa in the database. By the way, if it seems to you that you have already heard something about pages and synchronization, then that's how it is! Virtual memory in the OS works in a fairly similar way, although the OS does not have the task of stuffing all the data into the disk, and there is also hardware support. Nevertheless, the approaches and even the terminology are very similar.

Database from a bird's-eye view. Buffer manager and write-ahead logging policies

Buffer manager in a transactional system can work in different modes, which are sometimes called policies. One of these policies is steal / no steal. The essence is this: inside the transactional system, they are executed... transactions. They change the contents of clean pages, which makes them dirty. A single transaction can change the contents of several pages at the same time. A transaction can also take a relatively long time, up to a commit or a full rollback. Here the question arises — is it possible to write a dirty page to disk, making it clean, before all transactions using this page are completed. If this can be done, BM adheres to the steal policy. If it is not possible, i.e. only pages that are not currently being modified can be recorded, the no steal policy is used.

The steal policy allows you to throw out a bunch of synchronization primitives and increase access competitiveness, but on the other hand, with a steal policy, you need to control which part of the updates of each transaction has already been written to disk, and which is in dirty pages.

Another important BM policy is force/no—force. With a force policy, a transaction cannot end with a commit until all its changes are written to non-volatile storage. In other words, each transaction always syncs its changes to disk. The Force policy greatly simplifies the recovery process, because if the transaction managed to commit before the database failed, the data is definitely in the repository. On the other hand, with a force policy, you can forget about performance. Syncing every change to the disk in general is super expensive.

ARIES allows you to work with the softest set of policies — steal + no-force, and still provide "guaranteed" data retention. In quotes, because it is impossible to take and discard all the myriad features of the OS, disks, SSDs and their drivers, because of which it seems that recording is guaranteed, but not always. Before moving on to ARIES, it remains to say about the write-ahead logging algorithm/protocol. It is called a protocol in the work itself.

Write-ahead logging is used in systems in which pages with changes are overwritten. In other words, there is one copy of the page on disk and its clone in memory in the form of a clean or dirty page. No second versions elsewhere, just one copy. This approach imposes some restrictions, namely: when using WAL, before a dirty page in memory gets the right to overwrite a page on disk, a log containing the changes must first be saved to disk. First, the log with information about the update, then the update itself. That's why the approach is called write-ahead logging.

In general, WAL is a log familiar to everyone. Like any log, it is written only forward. Synchronization of such a log also happens quite quickly, since old data cannot change. The word "synchronization" is not used here by chance, because WAL is another buffer that lives in the database memory and hypothetically may not live to write to disk. To prevent this from happening, as a rule, the database synchronizes WAL during commits/interrupts of transactions. Another reason for WAL synchronization is exceeding the buffer size or simply by time. In postgres, for example, the buffer size and the time after which the recording will take place can even be configured.

I will also note that the write-ahead logging protocol and the log log itself are often called in the literature in one word — WAL. This usually does not lead to confusion, since it is clear from the context whether we are talking about a protocol or a file. In general, the WAL protocol is a rule due to which a page update can get into the disk only after a log update gets into the same disk. And the WAL log is a file with transaction records.

WAL content in ARIES

Finally, we came to the most interesting thing, how ARIES actually works. It works through the repetition of history. This means that WAL should contain enough information to repeat this very story, which means we must first talk about the pieces that make up WAL and the auxiliary structures in the database. Now there will be a lot of names and terms, but below is an example, so hold on. As a last resort, there is a link to a youtube lecture at the end of the last post — I strongly advise if there are any questions.

Since the entry in WAL is sequential, each entry in it can be assigned a monotonically increasing number. It is called log sequence number or LSN. In some situations, the LSN does not even need to be assigned, since the recording goes sequentially, i.e. the address on the disk will be suitable as the LSN. This works until the log no longer fits on the disk. To avoid this problem already, if your hacks, for example, a disk can be considered as a cyclic buffer, and LSN will be written as an address + the number of cycles, but these are implementation details. One thing is important — each entry in WAL contains an LSN, which can be used to build a history of database updates.

In addition to LSN, each record also contains 3 other fields: prevLSN, TrId and type. PrevLSN points to the previous log entry created by the transaction number TrId. In other words, a set of records with the same TrId form a connected list, where the previous record can always be found by prevLSN.

Type indicates the type of record. Different sources allocate different numbers of types, but three will be enough for us. The type depends on which other fields are contained in one record in WAL.

The simplest type of record is "commit". It does not contain any other fields and only says that the transaction was completed successfully.

Next comes the record responsible for changing or creating new data, namely "update". Each update contains in addition to prevLSN, TrId and type="update" 3 new fields. The first pageId is the page where the update is taking place, and the second and third are redoData and UndoData. Redo is directly the changes that are made as part of the transaction. There may be some increment of the value or a new value for the column. Undo is the reverse operation. If Redo is an increment, Redo is a decrement. If Redo is a new value, Undo will contain the old one. It is not so important whether the undo/redo contains a logical change or the result after its application — both approaches work, it is important that in one update entry in WAL it is written both how to apply the change and how to roll it back.

The last interesting type of log is compensating logs, they are compensation log records, they are CLR. Such logs are recorded by the database during rollbacks. For each record with the update type, with a full rollback, there will be one record with the CLR type, only they will be applied in reverse order. First, roll back the oldest update, then the one that was before it, and so on. Each CLR, In addition to prevLSN, TrId and type="compensations", contains a pageId, which, as in update logs, indicates the page number, and two more fields that I named in a special way to make it clearer, but I'm not sure what happened.

First field: redoTheUndoData. CRL is used for rollbacks, i.e. it rolls back the transaction. The information about the rollback is in the associated "update" log in the UndoData field. The CRL just contains this UndoData in the redoTheUndoData field. In other words, the CRL tells you what operation needs to be performed to roll back the associated record.

Second field: undoNextLSN. It stores information about an entry in WAL that needs to be rolled back after the current one is rolled back. If only one record needs to be rolled back, it may contain null or another analog of an empty value. Thus, several CRL logs are not directly connected to each other, but are indirectly connected via undoNextLSN.

Example WAL, other data structures

It's time to give an example. To simplify, I will omit working with pages, pageId fields and show how all records in WAL are connected.

Let there be a certain page X with two fields k and n. There are also two parallel transactions, one increases k by 2, and the second reduces n by 3. Then the first transaction again increases k by 9. So far none has managed to commit. I will designate each entry in WAL with a number that is equal to LSN, but as you remember, it is not necessary to write LSN — the log address is also suitable for this. I'll start writing a log in the middle, so that I don't get confused again either. Instead of "update" it will be just "u", "commit" - "c", and I will replace "compensation" with "clr"

Let me remind you that the record with the update type contains the fields prevLSN, TrId, pageId (omit), redoData and UndoData:

77. [-, 1, "u", k+=2, k-=2]
78. [-, 2, "u", n-=3, n+=3]
79. [77, 1, "u", k+=9, k-=9]

The first two entries contain "-" instead of the prevLSN field, as they are the first changes in their transactions. Record 3 (LSN = 79) in the prevLSN field contains 77 - this is an indication of the LSN of the previous record made by this transaction, i.e. the record with LSN = 77.

Now let transaction 2 make a commit, and transaction 1 make a full rollback of previous updates, then increase k by 13, and then make a commit. The following situation will turn out:

77. [-, 1, "u", k+=2, k-=2]
78. [-, 2, "u", n-=3, n+=3]
79. [77, 1, "u", k+=9, k-=9]
...
88. [78, 2, "c"]
89. [79, 1, "clr", k-=9, 77]
90. [89, 1, "clr", k-=2, -]
91. [90, 1, "u", k+=13, k-=13]
92. [91, 1, "c"]

To begin with, let's look at record 89. The redoTheUndoData field contains k-=9 - exactly the same as what was contained in the previous record 79 with the update type in the UndoData field. In fact, when creating a CLR record, data is copied here. The undoNextLSN field contains 77, i.e. a pointer to the next record to be rolled back during this rollback. It is copied from the prevLSN of the record that is being rolled back. An entry with LSN 79 has 77 written in the prevLSN field.

If there were no data in undoNextLSN, it would mean that the rollback is partial, i.e. only one record out of two is rolled back. In other words, the presence of an entry in the undoNextLSN means that the rollback has not yet been completed. When the rollback rolls back the last entry in the CRL, there will be no link in the undoNextLSN. This particular case is shown in the entry number 90. There is also a CLR entry here, but there is no entry following it, so there is a "-" in the undoNextLSN field.

In addition to the special way of filling in WAL, ARIES still needs to maintain two tables next to the log. They can be collected by WAL, so it's enough to keep them in memory.

The first is the dirty pages table (DPT). It contains only two fields: pageId for the page number and a pointer to the first entry in WAL, after which the page became dirty. This field is called recoveryLSN and it is used to find the record from which to start recovery. The table changes in two cases. If the next entry in WAL changed a page that was clean— the number of this page and the LSN number of the entry are added here. If the page is synced to disk, the record from the DPT is deleted.

The second structure needed for ARIES to work is the transactions table (TT). It contains all the unfinished transactions with their TRID identifier, as well as the last change that occurred within this transaction LastLSN. Each record in WAL has a pointer to the previous record in the prevLSN field, i.e. all records are combined into a linked list. The TT indicates the transaction and its last change, i.e. in fact the entry here is a pointer to the head of the linked list. The records in TT are updated whenever a change occurs within the transaction. The record is deleted when a commit or full rollback is made.

The recovery algorithm. Analysis phase and redo

The recovery algorithm in ARIES consists of three steps: analysis, redo-phase, undo-phase. The phases are always executed in this sequence, even if the restart failed and it needs to be restarted.

We will consider the phases using the example of WAL, in which there are 4 records at the time of recovery, after which a restart occurred. Here we will add the pageId field to show how the DPT dirty pages table is changing. Let there be two transactions, one executed on page X and the other on page Y. WAL will look like this.

30. [-, 1, X, "u", k+=2, k-=2]
31. [-, 2, Y, "u", n-=3, n+=3]
32. [30, 1, X, "u", k+=9, k-=9]
33. [31, 2, Y, "c"]
-- RESTART

There is transaction 2 (starts with record 31), which ended with a commit, and transaction 1, in which 2 updates occurred, which got into WAL, but the commit did not happen. Since transactions should occur atomically, the second transaction after recovery should roll back.

In the first phase of the analysis, ARIES goes through the records in WAL and restores the state of the dirty pages table DPT and the table of incomplete transactions TT.

For example, after analyzing the first two entries 30 and 31, there will be entries in the DPT:

X, 30
Y, 31

After analyzing all records up to 33, the DPT contains:

X, 30
Y, 31

As you can see, after analyzing the last two records, nothing has changed. DPT is talking about dirty pages. Only the first entry that made the page dirty is important to this table. Commits have no meaning for DPT, since ARIES can work in no-force mode, i.e. a commit does not mean that the page has been written to disk.

Now to the completed transactions table. After analyzing records 30 and 31, TT looks like this:

1, 30
2, 31

Nothing unexpected. The TT indicates the latest changes in each of the pending transactions. No transaction has been completed yet, so there are two records.

After analyzing the last two records, TT will turn into:

1, 32

As you can see, transaction 2 disappeared from TT because she managed to commit. The last change in transaction 1 occurred in a row with LSN 32, so it is still in TT.

The last thing that happens in the analysis phase is the search for the LSN from which to start the Redo phase. To do this, just take a minimum of the entries in the dirty DTP transaction table. In our case, this is an entry with LSN 30.

Phase 2 is Redo. During the execution of Redo, the page buffers are brought to the same state they were in before the restart. If the analysis only created the DTP and TT table, during the redo phase, the part of the page data that is stored in memory will be brought into full compliance with the information in WAL. In other words, dirty pages will really become dirty in memory. During the Redo phase, even updates of transactions that failed, like transaction 1 in our case, are executed.

The recovery algorithm. Undo phase

Finally Phase 3 is Undo. After the execution of the first two phases, the table of incomplete TT transactions may not be empty. These transactions did not have time to commit. So they have to roll back in order to preserve the atomicity property.

During Undo, transactions should roll back in reverse chronological order relative to the order in which they were recorded. To do this, while there are entries in TT, the Undo phase uses the following algorithm:

The record with the largest LSN is selected.
If the record is update— this record is rolled back using the CRL entry in WAL. The same record that is usually recorded in WAL during rollbacks. In fact, ARIES replaces unsuccessful transactions with successful ones that simply made a rollback. A new CLR record is added to TT for further analysis.
If this entry is CLR ->, its undoNextLSN value is added instead. If this value is not present, the record is deleted from TT.

Let's look at the example above. WAL still looks like this:

30. [-, 1, X, "u", k+=2, k-=2]
31. [-, 2, Y, "u", n-=3, n+=3]
32. [30, 1, X, "u", k+=9, k-=9]
33. [31, 2, Y, "c"]
-- RESTART

Undo phase is looking at TT. He sees record 1, 32 there and starts rolling it back using the clr record. After the rollback, the new state of WAL will be:

30. [-, 1, X, "u", k+=2, k-=2]
31. [-, 2, Y, "u", n-=3, n+=3]
32. [30, 1, X, "u", k+=9, k-=9]
33. [31, 2, Y, "c"]
34. [32, 1, X, "clr", k-=9, 30]

And TT will be updated to

1, 34

Adding a new CLR record to WAL follows the same write-ahead protocol as adding regular records. In other words, this record cannot get to disk until it is saved to the WAL log.

When rolling back 32 records, it is seen that this transaction is associated with record 30 (prevLSN field). This value is copied to the undoNextLSN CLR entry. UndoData is copied to redoTheUndoData in the same way.

Let there be 1 more restart at this moment. ARIES promises that everything will work even with restarts. Let's see!

30. [-, 1, X, "u", k+=2, k-=2]
31. [-, 2, Y, "u", n-=3, n+=3]
32. [30, 1, X, "u", k+=9, k-=9]
33. [31, 2, Y, "c"]
34. [32, 1, X, "clr", k-=9, 30]
-- RESTART 2

Let's run the analysis again. After analyzing the first 4 records (30-33), the state of TT and DPT is the same as it was before at the end of the first restart.

DPT:

X, 30
Y, 31
TT:
1, 32

But the analysis has not been completed yet, because there is an entry at number 34. After processing this record, DPT will not change, but TT will change to

1, 34

As you can see, this is exactly the same state that TT was in during the undo phase of the previous restart. ARIES came to exactly the same state as before.

The Redo phase will update the state of the page buffers again. It remains only to complete the recovery process in the undo phase. Again, we look at the record with the largest LSN - this is record 34. Point 3 of the algorithm says that if the record is CLR, the value in TT should be updated to undoNextLSN. New TT condition:

1, 30

Again, we are looking for the largest LSN by TT. We find 30. The update record, therefore, must be rolled back using the CLR. The record does not have a prevLSN, so this is the last CLR record. Now the state of WAL will be:

30. [-, 1, X, "u", k+=2, k-=2]
31. [-, 2, Y, "u", n-=3, n+=3]
32. [30, 1, X, "u", k+=9, k-=9]
33. [31, 2, Y, "c"]
34. [32, 1, X, "clr", k-=9, 30]
35. [34, 1, X, "clr", k-=3, -]

And in TT it will turn out:

1, 35

The last step remains — we pull out the largest LSN from the TT - 35. This is CLR, so the value in TT for transaction 1 in accordance with paragraph 3 needs to be updated to undoNextLSN, but there is "-". This means that we have made a complete rollback of the transaction and the record can be deleted. Finally, there are no more entries in TT. The state of the database has been successfully restored.

Conclusion

ARIES is an efficient and fairly simple recording and recovery algorithm. It works by completely repeating the story. As shown in the example, it continues to work even with multiple restarts in a row. ARIES can work in the steal + no-force buffer manager mode. In other words, it does not require commits to write dirty pages to disk, and also allows you to write pages even during transactions.

ARIES performs recovery in three steps.

First, the records are analyzed in WAL. The analysis restores the state of the dirty DPT page table, as well as the incomplete TT transaction table.

In the next phase — Redo, the state of the page buffers in memory is brought into full compliance with WAL. After the execution of Redo, the internal state of the pages is identical to what it was before the start of the recovery process.

Finally, in the Undo phase, transactions that did not have time to commit before the restart start are rolled back. During Undo, each update WAL record related to an incomplete transaction is recorded in reverse chronological order by the CLR record. Exactly the same record is usually recorded in WAL during rollbacks in normal operation. This approach allows, on the one hand, to simplify the recovery algorithm for repeated restarts, and on the other, to limit the maximum number of new records in WAL that will be created during recovery. When creating a CLR, it uses the write-ahead protocol, as with normal writing in normal mode.

I have already gone beyond all imaginable and unimaginable limits of the size of the post, so I think I'll finish. There are a couple more things I haven't told you about.

The first is checkpoints. The database can periodically take the current state of the table of dirty DPT pages and incomplete TT transactions and write it to some reliable place. This approach allows you not to view the entire log at all during the ARIES analysis phase, but first restore the checkpoint and roll the update from it.

Secondly, each phase of ARIES can not only survive a restart without problems, but can also be started in parallel and safely in several threads. This is achieved through page-level locks. Moreover, during each phase, the database can make checkpoints. So, for example, a checkpoint after the analysis phase will allow you not to do the analysis again if the second restart occurs immediately.

Third— I said that ARIES allows you to make granular locks, i.e. to lock both at the page level and at the level of a single line, but I did not say how. Everything is difficult with locks in databases in general, someday I will write a separate post about lock managers.

Fourth — there is a distributed ARIES! D-ARIES: A Distributed Version of the ARIES Recovery Algorithm (link). And ARIES (or rather WAL) is optimized for hard drives. SSD sequential recording is needed to a lesser extent, i.e. they can be used to make a more efficient algorithm. An example of such an algorithm is in the work From ARIES to MARS: transaction support for next-generation, solid-state drives (link). This is all good, too, in the backlog.

Sources:

[1] ARIES whitepaper dl.acm.org/doi/10.1145/128765.128770 [2] Beautiful youtube video about ARIES by Dr. Jens Dittrich youtube.com/watch?v=S9nctHdkggk . There's a whole university course of video lectures about databases. [3] Ramakrishnan, Gehrke - Database Management System Third Edition (link) [4] Peter Bailis Joseph M. Hellerstein Michael Stonebraker — Readings in Database Systems Fifth Edition. She's a Red Book. (link)

Shark in IT