Exonum MerkleDB is a persistent storage implementation based on RocksDB. MerkleDB provides APIs to work with merkelized data structures.
MerkleDB is an object database. Objects represent the highest abstraction level for the data storage. All objects fall into the following groups:
- Blobs which represent sequences of bytes
- Indexes have a unique identifier consisting of a UTF-8 string and an optional byte prefix. Indexes store blobs within them.
The rest of the article describes the following:
- Index types lists the types of data stores supported by MerkleDB.
- Low-level storage explains how objects are persisted using RocksDB.
- View layer describes the wrapper over the DB engine that ensures atomicity of database changes.
- Indexing explains how indices over structured data can be built in MerkleDB.
- State aggregation describes how entire database state is represented by an automatically updated single hash digest.
- Migrations explains tooling for atomic, fault-tolerant migrations within the database.
Indexes perform the same role as tables in relational database management systems (RDBMSs). All MerkleDB indexes internally are implemented as wrappers around key-value stores.
Indexes in MerkleDB fall into several types:
- list of items
- map, where key-value pairs are stored
- set of unique items
- entry, which represents an optional single item.
Both keys and values in the wrapped stores are persisted as byte sequences. MerkleDB does not natively support operations (matching, grouping, sorting, etc.) over separate value fields, as it is the case with other key-value storages.
MapIndex implements a key-value store aka a map. It has
the following functionality:
- Get, set and remove value by key
- Check if a specific key is present in the map
- Iterate over the key-value pairs in the lexicographic key order
- Iterate over keys or values in the lexicographic key order
- Clear the map (i.e., remove all stored key-value pairs)
- Get the total number of items in the index.
ListIndex represents an array list. The following operations are supported:
- Get and set a list item by an index
- Append an item to the list
- Pop or poll the last item from the list
- Get the list length
- Check if the list is empty
- Iterate over index-item pairs ordered by indices
- Extend the list by adding a sequence of items from an iterator to the end of the list
- Truncate the list to the specified length
- Clear the list (i.e., remove all stored items from the list)
- Get the total number of items in the list.
ListIndex does not support inserting items in the middle of the
list or removing items by the index
(although it is still possible to implement these operations manually).
To support proper iteration, 8-byte unsigned indices precede
items. Indices are implicitly defined by the items order. Indices are
serialized in the big-endian form.
SparseListIndex represents a
ListIndex that may
contain "gaps". It provides the possibility to delete items not only from
the end of the list, but from any part thereof. Such deletions do not break
the order of the indices inside the list.
The remaining functionality of the
SparseListIndex is the same as for
ListIndex. As in
ListIndex, extension of the list is
possible only by adding a sequence of items from an iterator to the end of the
ValueSetIndex represents a hash set. The following operations are implemented:
- Add and remove set items
- Check if an item is already present using the item itself or its hash
- Iterate over stored items in the lexicographic order of their hashes
- Iterate over hashes of items in the lexicographic order
- Clear the set (i.e., remove all items)
- Get the total number of items in the index.
Hashes used in
ValueSetIndex are calculated with the
ValueSetIndex uses hashes of items as keys,
and items themselves as corresponding values.
KeySetIndex represents a set. The following procedures are implemented:
- Add and remove set items
- Check if a specific item is in the set
- Iterate over items in the lexicographic order of their binary representation
- Clear the set (i.e., remove all stored items)
- Get the total number of items in the index.
Internally, the item is used as a key, and its value is always empty.
KeySetIndex vs ValueSetIndex¶
ValueSetIndex uses a hash as a key,
KeySetIndex puts an entire binary
serialization of an item into the key.
KeySetIndexdoes not have an additional overhead on hashing set items.
KeySetIndexshould not be used when set items are relatively big; only small items should be stored in it (such as integers, small strings, small tuples). On the other hand,
ValueSetIndexhandles storing big and complex items more easily.
KeySetIndexintroduces a lexicographical order over stored items. In
ValueSetIndexitems are ordered according to their hash function properties.
Entry represents an optional single item (i.e.,
Option<T> in Rust terms).
The following operations are implemented:
- Get, set and remove the value
- Check if the value is present.
Merkelized indexes represent a list and a map with additional features. Such indexes can create proofs of existence or absence for stored data items.
When a light client requests data from a full node, the proof can be built and sent along with the actual data. Having block headers and this proof, the client may check that the received data was really authorized by the validators without having to replicate the entire blockchain contents.
See Merkelized List for more technical
ProofListIndex and related proofs.
ProofListIndex implements a Merkle tree, which is a merkelized version of an
array list. It implements the same methods as
ListIndex except for
truncation of the list.
ProofListIndex adds an
additional feature: based on Merkle trees,
ProofListIndex allows efficiently
creating compact proofs of existence for the list items.
The following additional procedures are implemented:
- Get the height of the Merkle tree. As the tree is balanced (though may be not
full), its height is close to
log2of the list length
- Get the value of the tree root (i.e., the hash of the entire Merkle tree)
- Build a proof of existence/absence for an item at a specific position
- Build a proof of existence/absence for items at a specific contiguous list range.
ListIndex, list items are stored with 8-byte keys. However,
ProofListIndex also persists all intermediate nodes of the Merkle tree
built on top of the list, in order to quickly build proofs and recalculate
the Merkle tree after operations on the list.
See Merkelized Map for more technical
ProofMapIndex and related proofs.
ProofMapIndex is a merkelized version of a map based on a binary
Merkle Patricia tree.
It implements the same methods as
MapIndex. It is also able to
create proofs of existence for its key-value pairs, or proofs of absence
if a key is absent in the map. The following additional
procedures are supported:
- Get the value of the root node
- Build a proof for the requested key. The tree proves either key existence (and its value), or key absence.
ProofEntry is a merkelized version of
It has the same functionality as its ordinary counterpart.
MerkleDB uses a third-party database engine to persist blockchain state locally. Currently the main database engine is RocksDB. It is also possible to plug in other engines.
To use a particular database, a minimal
interface should be implemented for it:
- Get a value by a column family name and a key
- Put a new value at the specified column family / key (insert or update the saved one)
- Delete a key-value pair by a column family name / key.
All the index functionality is reduced to these atomic call types. Values of items of different indexes may be stored in a single column family in the low-level storage. Their keys are mapped to the low-level storage keys in a deterministic manner using index addresses.
On user level every index is uniquely identified by its full address. An index address consists of 2 parts:
- String name that may contain uppercase and lowercase Latin letters,
.. Index names in services start with the service name and a period. For example, the only index in the Cryptocurrency Tutorial is named
cryptocurrencyis the service name, and
walletsis the own name of the index
- Optional prefix presented as a sequence of bytes (
Vec<u8>in Rust terms). Prefixes allow to group logically related indexes.
Index groups – indexes with the same string name but different address prefixes – can be used to store collections keyed by an identifier. For example, a group of list indexes may be used to store transaction histories for wallets; in this case, an identifier could be the public key of the wallet.
Any index in the group can be accessed using a usual API. It is also possible to iterate over prefixes in a group.
Key Sorting and Iterators¶
MerkleDB indexes support iterating over index contents:
- items for lists and sets
- keys, values or key-value pairs in case of map indexes.
Such iterators use key ordering of the low-level key-value storage to determine the iteration order. Namely, keys are lexicographically ordered according to their binary serializations.
Exonum introduces additional layer over the database to handle transaction and block atomicity.
Patch is a set of serial changes that should be applied to the low-level storage atomically. A patch may include two types of operations: put a value addressed by a key, or delete a value by a key.
Snapshot fixes the storage state at the moment of snapshot creation and provides a read-only API to it. Even if the storage state is updated, the snapshot still refers to the old content of the stored indexes.
Forks implement the same interfaces as the database underneath, transparently wrapping the real data storage state, and some additional changes. Every fork is based on the storage snapshot. From the outer point of view, the changes are eagerly applied to the data storage; however, these changes are stored directly in the fork and may be easily rolled back. You can create multiple mutable indexes from one shared reference to the fork. Moreover, there may be different forks of the same database snapshot.
Forks are used during transaction and block processing. A fork is successively passed to each transaction in the block to accumulate changes produced by the transactions. If one of the transactions in the block quits with an unhandled exception during execution, its changes are promptly rolled back, so that execution of the following transactions continues normally.
Snapshots, forks and patches are raw; they provide access to the entire database. On top of these objects, MerkleDB provides tools for fine-grained access control:
Prefixedaccess is used to restrict the user to a single namespace. The namespace is a UTF-8 string, which is prepended to the user-provided address name together with a dot
.. For example, an index with the address
"bar"in the namespace
"foo"is mapped to the index with the full address
Migrationrepresents data created during a migration.
Migrations have a namespace, but unlike
Prefixed, their data cannot be accessed in any other way.
Scratchpadhosts temporary data during a migration. Like other accesses, scratchpads are namespace-separated and like
Migration, the access from a scratchpad is unique.
test concerns indexes with an address starting with
test., such as
(test.bar, 1_u32), but not
Unlike relational databases, Exonum does not support indexes over fields of blobs as a first-class entity. However, it is possible to create auxiliary indexes with indexing semantics. Content of the auxiliary indexes should be updated according to the content of the original objects that they index.
MerkleDB automatically aggregates its contents into a single state hash, which commits to the entire Merkelized database contents. This is used in Exonum to achieve consensus as to the database state without requiring a single line of code from the service developers.
The state hash of the database is the hash of the state aggregator,
ProofMapIndex with keys being UTF-8 names of aggregated indexes,
and values their hashes. An index is aggregated if and only if it satisfies
the following constraints:
- Index has a matching type (
- Index is not a part of a group, i.e., its address has an empty prefix
Snapshot and patches are always consistent with respect to the aggregated state; the index hashes in the state aggregator match their actual values. This is not the case for forks, in which the state aggregator may be stale.
Migration accesses have their separate aggregators that can be obtained via corresponding API.
Migration refers to the ability to update data in indexes, remove indexes, change index type, create new indexes, and package these changes in a way that they can be atomically committed or rolled back. Accumulating changes in the migration, on the other hand, can be performed iteratively, including after a process shutdown.
Each migration is confined to a namespace, defined in a similar way as namespaces for fine-grained accesses.
Migration is non-destructive, i.e., does not remove the old versions
of migrated indexes. Instead, new indexes are created in a separate namespace,
which can be accessed via
For example, index
foo in the migration namespace
test and the original
test.foo index can peacefully coexist and have separate data and even
different types. The movement of data is performed only when the migration
is flushed. A migration can also store temporary data in a
Indexes created within a migration are not aggregated
in the default state hash. Instead, they are placed in a separate namespace,
the aggregator and state hash for which can be obtained via
In Exonum, this is used to ensure that data migration has the same outcome for
all nodes in the blockchain network.
Once a migration logic has completed, the migration can be either flushed or
rolled back. Flushing will replace old index data with new, remove indexes
marked with tombstones, and return migrated indexes to the default state aggregator.
Rolling back will simply remove all data in a
Migration. Both flushing and
rolling back also clear the scratchpad associated with the migration.