Exonum MerkleDB¶
Exonum MerkleDB is a persistent storage implementation based on RocksDB. MerkleDB provides APIs to work with merkelized data structures.
MerkleDB is an object database. Objects represent the highest abstraction level for the data storage. All objects fall into the following groups:
- Blobs which represent sequences of bytes
- Indexes have a unique identifier consisting of a UTF-8 string and an optional byte prefix. Indexes store blobs within them.
The rest of the article describes the following:
- Index types lists the types of data stores supported by MerkleDB.
- Low-level storage explains how objects are persisted using RocksDB.
- View layer describes the wrapper over the DB engine that ensures atomicity of database changes.
- Indexing explains how indices over structured data can be built in MerkleDB.
- State aggregation describes how entire database state is represented by an automatically updated single hash digest.
- Migrations explains tooling for atomic, fault-tolerant migrations within the database.
Index Types¶
Indexes perform the same role as tables in relational database management systems (RDBMSs). All MerkleDB indexes internally are implemented as wrappers around key-value stores.
Indexes in MerkleDB fall into several types:
- list of items
- map, where key-value pairs are stored
- set of unique items
- entry, which represents an optional single item.
Both keys and values in the wrapped stores are persisted as byte sequences. MerkleDB does not natively support operations (matching, grouping, sorting, etc.) over separate value fields, as it is the case with other key-value storages.
MapIndex¶
MapIndex
implements a key-value store aka a map. It has
the following functionality:
- Get, set and remove value by key
- Check if a specific key is present in the map
- Iterate over the key-value pairs in the lexicographic key order
- Iterate over keys or values in the lexicographic key order
- Clear the map (i.e., remove all stored key-value pairs)
- Get the total number of items in the index.
ListIndex¶
ListIndex
represents an array list. The following operations are supported:
- Get and set a list item by an index
- Append an item to the list
- Pop or poll the last item from the list
- Get the list length
- Check if the list is empty
- Iterate over index-item pairs ordered by indices
- Extend the list by adding a sequence of items from an iterator to the end of the list
- Truncate the list to the specified length
- Clear the list (i.e., remove all stored items from the list)
- Get the total number of items in the list.
ListIndex
does not support inserting items in the middle of the
list or removing items by the index
(although it is still possible to implement these operations manually).
Implementation Details
To support proper iteration, 8-byte unsigned indices precede ListIndex
items. Indices are implicitly defined by the items order. Indices are
serialized in the big-endian form.
SparseListIndex¶
SparseListIndex
represents a ListIndex
that may
contain "gaps". It provides the possibility to delete items not only from
the end of the list, but from any part thereof. Such deletions do not break
the order of the indices inside the list.
The remaining functionality of the SparseListIndex
is the same as for
ListIndex
. As in ListIndex
, extension of the list is
possible only by adding a sequence of items from an iterator to the end of the
list.
ValueSetIndex¶
ValueSetIndex
represents a hash set. The following operations are implemented:
- Add and remove set items
- Check if an item is already present using the item itself or its hash
- Iterate over stored items in the lexicographic order of their hashes
- Iterate over hashes of items in the lexicographic order
- Clear the set (i.e., remove all items)
- Get the total number of items in the index.
Hashes used in ValueSetIndex
are calculated with the object_hash()
method
of the ObjectHash
trait.
Implementation Details
Internally, ValueSetIndex
uses hashes of items as keys,
and items themselves as corresponding values.
KeySetIndex¶
KeySetIndex
represents a set. The following procedures are implemented:
- Add and remove set items
- Check if a specific item is in the set
- Iterate over items in the lexicographic order of their binary representation
- Clear the set (i.e., remove all stored items)
- Get the total number of items in the index.
Implementation Details
Internally, the item is used as a key, and its value is always empty.
KeySetIndex vs ValueSetIndex¶
While ValueSetIndex
uses a hash as a key, KeySetIndex
puts an entire binary
serialization of an item into the key.
KeySetIndex
does not have an additional overhead on hashing set items.KeySetIndex
should not be used when set items are relatively big; only small items should be stored in it (such as integers, small strings, small tuples). On the other hand,ValueSetIndex
handles storing big and complex items more easily.KeySetIndex
introduces a lexicographical order over stored items. InValueSetIndex
items are ordered according to their hash function properties.
Entry¶
Entry
represents an optional single item (i.e., Option<T>
in Rust terms).
The following operations are implemented:
- Get, set and remove the value
- Check if the value is present.
Merkelized Indexes¶
Merkelized indexes represent a list and a map with additional features. Such indexes can create proofs of existence or absence for stored data items.
When a light client requests data from a full node, the proof can be built and sent along with the actual data. Having block headers and this proof, the client may check that the received data was really authorized by the validators without having to replicate the entire blockchain contents.
ProofListIndex¶
Tip
See Merkelized List for more technical
details on ProofListIndex
and related proofs.
ProofListIndex
implements a Merkle tree, which is a merkelized version of an
array list. It implements the same methods as ListIndex
except for
truncation of the list. ProofListIndex
adds an
additional feature: based on Merkle trees, ProofListIndex
allows efficiently
creating compact proofs of existence for the list items.
The following additional procedures are implemented:
- Get the height of the Merkle tree. As the tree is balanced (though may be not
full), its height is close to
log2
of the list length - Get the value of the tree root (i.e., the hash of the entire Merkle tree)
- Build a proof of existence/absence for an item at a specific position
- Build a proof of existence/absence for items at a specific contiguous list range.
Implementation Details
As with ListIndex
, list items are stored with 8-byte keys. However,
ProofListIndex
also persists all intermediate nodes of the Merkle tree
built on top of the list, in order to quickly build proofs and recalculate
the Merkle tree after operations on the list.
ProofMapIndex¶
Tip
See Merkelized Map for more technical
details on ProofMapIndex
and related proofs.
ProofMapIndex
is a merkelized version of a map based on a binary
Merkle Patricia tree.
It implements the same methods as MapIndex
. It is also able to
create proofs of existence for its key-value pairs, or proofs of absence
if a key is absent in the map. The following additional
procedures are supported:
- Get the value of the root node
- Build a proof for the requested key. The tree proves either key existence (and its value), or key absence.
ProofEntry¶
ProofEntry
is a merkelized version of Entry
.
It has the same functionality as its ordinary counterpart.
Low-level Storage¶
MerkleDB uses a third-party database engine to persist blockchain state locally. Currently the main database engine is RocksDB. It is also possible to plug in other engines.
To use a particular database, a minimal Database
interface should be implemented for it:
- Get a value by a column family name and a key
- Put a new value at the specified column family / key (insert or update the saved one)
- Delete a key-value pair by a column family name / key.
All the index functionality is reduced to these atomic call types. Values of items of different indexes may be stored in a single column family in the low-level storage. Their keys are mapped to the low-level storage keys in a deterministic manner using index addresses.
Index Addresses¶
On user level every index is uniquely identified by its full address. An index address consists of 2 parts:
- String name that may contain uppercase and lowercase Latin letters,
digits, underscores
_
, hyphens-
and periods.
. Index names in services start with the service name and a period. For example, the only index in the Cryptocurrency Tutorial is namedcryptocurrency.wallets
, wherecryptocurrency
is the service name, andwallets
is the own name of the index - Optional prefix presented as a sequence of bytes (
Vec<u8>
in Rust terms). Prefixes allow to group logically related indexes.
Index Groups¶
Index groups – indexes with the same string name but different address prefixes – can be used to store collections keyed by an identifier. For example, a group of list indexes may be used to store transaction histories for wallets; in this case, an identifier could be the public key of the wallet.
Any index in the group can be accessed using a usual API. It is also possible to iterate over prefixes in a group.
Key Sorting and Iterators¶
MerkleDB indexes support iterating over index contents:
- items for lists and sets
- keys, values or key-value pairs in case of map indexes.
Such iterators use key ordering of the low-level key-value storage to determine the iteration order. Namely, keys are lexicographically ordered according to their binary serializations.
View Layer¶
Exonum introduces additional layer over the database to handle transaction and block atomicity.
Patches¶
Patch is a set of serial changes that should be applied to the low-level storage atomically. A patch may include two types of operations: put a value addressed by a key, or delete a value by a key.
Snapshots¶
Snapshot fixes the storage state at the moment of snapshot creation and provides a read-only API to it. Even if the storage state is updated, the snapshot still refers to the old content of the stored indexes.
Forks¶
Forks implement the same interfaces as the database underneath, transparently wrapping the real data storage state, and some additional changes. Every fork is based on the storage snapshot. From the outer point of view, the changes are eagerly applied to the data storage; however, these changes are stored directly in the fork and may be easily rolled back. You can create multiple mutable indexes from one shared reference to the fork. Moreover, there may be different forks of the same database snapshot.
Forks are used during transaction and block processing. A fork is successively passed to each transaction in the block to accumulate changes produced by the transactions. If one of the transactions in the block quits with an unhandled exception during execution, its changes are promptly rolled back, so that execution of the following transactions continues normally.
Accesses¶
Snapshots, forks and patches are raw; they provide access to the entire database. On top of these objects, MerkleDB provides tools for fine-grained access control:
Prefixed
access is used to restrict the user to a single namespace. The namespace is a UTF-8 string, which is prepended to the user-provided address name together with a dot.
. For example, an index with the address"bar"
in the namespace"foo"
is mapped to the index with the full address"foo.bar"
.Migration
represents data created during a migration.Migration
s have a namespace, but unlikePrefixed
, their data cannot be accessed in any other way.Scratchpad
hosts temporary data during a migration. Like other accesses, scratchpads are namespace-separated and likeMigration
, the access from a scratchpad is unique.
Example
Namespace test
concerns indexes with an address starting with
test.
, such as test.foo
or (test.bar, 1_u32)
, but not test
or test_.foo
.
Indexing¶
Unlike relational databases, Exonum does not support indexes over fields of blobs as a first-class entity. However, it is possible to create auxiliary indexes with indexing semantics. Content of the auxiliary indexes should be updated according to the content of the original objects that they index.
State Aggregation¶
MerkleDB automatically aggregates its contents into a single state hash, which commits to the entire Merkelized database contents. This is used in Exonum to achieve consensus as to the database state without requiring a single line of code from the service developers.
The state hash of the database is the hash of the state aggregator,
a ProofMapIndex
with keys being UTF-8 names of aggregated indexes,
and values their hashes. An index is aggregated if and only if it satisfies
the following constraints:
- Index has a matching type (
ProofListIndex
,ProofMapIndex
, orProofEntry
) - Index is not a part of a group, i.e., its address has an empty prefix
Snapshot and patches are always consistent with respect to the aggregated state; the index hashes in the state aggregator match their actual values. This is not the case for forks, in which the state aggregator may be stale.
Migration accesses have their separate aggregators that can be obtained via corresponding API.
Migrations¶
Migration refers to the ability to update data in indexes, remove indexes, change index type, create new indexes, and package these changes in a way that they can be atomically committed or rolled back. Accumulating changes in the migration, on the other hand, can be performed iteratively, including after a process shutdown.
Each migration is confined to a namespace, defined in a similar way as namespaces for fine-grained accesses.
Migration is non-destructive, i.e., does not remove the old versions
of migrated indexes. Instead, new indexes are created in a separate namespace,
which can be accessed via Migration
.
For example, index foo
in the migration namespace test
and the original
test.foo
index can peacefully coexist and have separate data and even
different types. The movement of data is performed only when the migration
is flushed. A migration can also store temporary data in a Scratchpad
.
Indexes created within a migration are not aggregated
in the default state hash. Instead, they are placed in a separate namespace,
the aggregator and state hash for which can be obtained via Migration
APIs.
In Exonum, this is used to ensure that data migration has the same outcome for
all nodes in the blockchain network.
Once a migration logic has completed, the migration can be either flushed or
rolled back. Flushing will replace old index data with new, remove indexes
marked with tombstones, and return migrated indexes to the default state aggregator.
Rolling back will simply remove all data in a Migration
. Both flushing and
rolling back also clear the scratchpad associated with the migration.