Sparse Merkle Trees and the New State Model

This blog post is part of the interoperability blog post series taking a closer look at various aspects of the Lisk interoperability solution. In this blog post, we introduce one of the cornerstones of Lisk-interoperability: The state model of an interoperable chain. To do so, we will first talk about sparse Merkle trees, introduced to the Lisk protocol in LIP 0039. Sparse Merkle trees are the fancy relatives of the regular Merkle trees. If you are not familiar with Merkle trees, we suggest you first read my previous blog post, as it will give you an introduction to many concepts we will discuss here. Next, I will present the state model, the state tree, and the new store interface, focusing finally on what will change from a developer perspective when interoperability is implemented into the new SDK.

Last month at Lisk.js 2021, the research team fully unveiled and presented the Lisk interoperability solution. Lisk.js is an annual event where developers can learn how to develop a blockchain application using the Lisk SDK and get the latest news about the Lisk protocol. Shortly after Lisk.js 2021, all core interoperability LIPs were published to the research forum.

# Sparse Merkle Trees Crash Course

A sparse Merkle tree (SMT) is an authenticated data structure organized as a tree, built on top of a key-value map. A key-value map is a collection of (key, value) pairs such that each key appears at most once. It supports the following operations:

**Look up**: Returns the value associated with a certain key.**Insert**: Inserts a certain key-value pair in the collection.**Update**: Updates the value associated with a certain key.**Delete**: Removes a certain key-value pair in the collection.

Consequently, the SMT also supports these operations, as explained in the following section. The SMT is constructed from the key-value map as follows. **Each possible entry of the key-value map corresponds to a leaf node of the tree**. Each leaf node has a key and value (corresponding to the entry key and value) as well as a hash property, obtained by hashing together the key and the value. If an entry is not present in the map, we just assign an empty leaf node to it (a node with an empty value and a constant hash). The position of the leaf node in the tree is fixed by its key. For instance, for a key length of 5 bits, the first leaf node has key 00000, the next ones have 00001, 00010, and so forth.

Parent branch nodes are obtained by hashing together two child nodes recursively, until at the top we are left with a single node, the Merkle root of the tree. Consequently, each branch node has a left child, right child, and hash property, obtained by hashing together the child nodes hash values. As a technical sidenote, we use different hash functions for leaf and branch nodes to prevent second image attacks (see section "Leaf versus Branch Hash" blog post).

*Figure 1: A complete sparse Merkle tree. Each possible key in the underlying key-value map has a corresponding leaf node in the base layer. Branch nodes (yellow boxes) are obtained by hashing recursively pairs of leaf nodes (green boxes), until a single node, the Merkle root (blue box), is reached. If the key length is 32 bytes, the tree has 256 layers (here elided for clarity) and 2256 leaf nodes.*

This is very reminiscent of regular Merkle trees, introduced in LIP 0031, with a very important difference: The order of insertion of data blocks does not matter, and the root of the tree only depends on the final state of the map. In cryptography, this means that a SMT is a **cryptographic accumulator** [1]. An accumulator is a hash function that possesses this history independence (quasi-commutative property) and allows for efficient **inclusion proofs**: a Prover is able to convince a Verifier that a certain element is present in the underlying data structure without revealing any other element, and the proof size scales logarithmically with the number of data blocks. In a SMT, the input of the hash function is the underlying key-value map and the output is the Merkle root. Due to the history independence, SMTs are the suitable choice to authenticate key-value maps.

On the other hand, a regular Merkle tree is an example of a **vector commitment** [2], because the underlying data structure is an array. This means that the Merkle root also authenticates the order of the data blocks in the array. This property can actually be very useful, e.g. to authenticate the transactions included in a block or the order of cross-chain messages in the Lisk interoperability solution, and in general if we are working with arrays. However, inserting a new data block at a specific position cannot be done efficiently, as the whole tree would need to be recomputed. Due to this fact, regular Merkle trees are not suitable to authenticate key-value maps but they are the suitable choice to authenticate arrays, where new data blocks are only appended.

*Figure 2: Comparison between a RMT (left) and a SMT (right). In the RMT, the underlying data structure is an array of data blocks, in this example the transactions inserted in a block. In a SMT, the underlying data structure is a general key-value map, in this example with only some non-empty key-value pairs. The RMT is suitable to work with an array, and the Merkle root (blue box) also authenticates the insertion order. A SMT is suitable to work with key-value maps, and the Merkle root does not depend on the insertion order.*

SMTs also have another nice property, in that they support efficient **non-inclusion proofs**: The Prover can convince the Verifier that a certain entry is *not* present in the map, and the proof size scales logarithmically with the number of data blocks. Because of this property, SMTs satisfy the definition of **universal accumulators** [4].

This tool can be useful in a variety of situations: Imagine the government is maintaining a central register of everyone that tested positive to Covid in the last month. Each person is assigned an ID which is the combination of their national ID number and the current month. The central register is then organized in a SMT, and to enter a public event, everyone has to show with a non-inclusion proof that their ID is not part of the central register.

However, there is a catch: As mentioned before, the SMT contains one node for each possible key, which for 32 bytes (256 bits) keys gives 2256 leaf nodes and 256 layers. Notice that in any realistic scenario, the vast majority of these leaf nodes would be empty, with no associated entry in the underlying map (hence the name *sparse* Merkle tree). Storing and manipulating such an astronomical number of nodes is impossible. Therefore we rely on the following optimizations:

- Each subtree with exactly one non-empty leaf is replaced by the leaf itself.
- Each subtree containing only empty nodes is replaced by an empty node, a constant node with a fixed hash value. The resulting construction is similar to Jellyfish Merkle trees proposed for the Diem protocol [5]. If the keys are distributed randomly, we obtain a tree with a number of layers that scale as Log(N), where N is the number of entries in the map. This guarantees that operations on the tree take on average Log(N) steps and (non-) inclusion proofs have on average size Log(N)

*Figure 3: We apply two optimizations to reduce the number of nodes in the tree: 1) Any subtree with exactly 1 non-empty node (green box) is replaced by the node itself. 2) Any subtree with only empty nodes (white boxes) is replaced by an empty node. Yellow boxes represent branch nodes.*

## Operations on the Tree

A crucial property of a SMT is the ability of adding, removing, and updating elements of the tree, for instance when the underlying map is updated. For our purposes, a key not present in the map can be considered as if it was paired with an empty value. In this sense, inserting can be seen as an update operation on a previously empty value, and deleting as an update operation to set a value to empty.

An accumulator supporting these operations is called a **dynamical accumulator** [3], although strictly speaking in the literature this property requires a cost constant with the number of accumulated elements, while we make do with logarithmic scaling.

### Update

Whenever we want to update a leaf node, the general strategy is to start from the root and navigate the tree downwards. The key of the new leaf is used to decide whether we should move to the left or right child, by looking at the binary representation of the key and moving to the left if the next digit is a 0 or to the right if it is a 1. We repeat this process until a leaf or empty node is encountered. If the final node has the same key of the new leaf, we simply update its value. Otherwise, if it is an empty node, we replace it with the new leaf.

*Figure 4: To add a new leaf node to the tree (dashed red box, leaf key = 5), we navigate from the root to the new leaf position using the digits of the binary representation of the leaf key. When an empty node is encountered (white box), we replace it with the new leaf.*

If instead the final node is a leaf node, we create new branch nodes until the keys of the new node and the final node diverge, and we add these nodes as children to the last branch node. Empty nodes are inserted as children of the new branch nodes as necessary. Finally we update all the branch nodes that were visited in this process.

*Figure 5: To add a new leaf node to the tree (dashed red box, leaf key = 3), we navigate from the root to the new leaf position using the key. When another leaf node is encountered (green box), we create as many branch nodes as necessary and add the two leaves as children.*

### Proof Generation and Verification

Generating a proof works in a similar way. We navigate the tree until we find a leaf or empty node. If the final node is the one we were generating a proof for, we have an inclusion proof, otherwise the queried node is not in the tree and we have a non-inclusion proof. The hashes of unvisited sibling nodes encountered in the process are given to the Verifier as part of the proof. However, we do not want to add the empty nodes, as their hash value is a protocol constant already known by the Verifier. Instead, we add a bitmap indicating which nodes in the inclusion path are non-empty.

In addition, we also support multiproofs, i.e. (non-)inclusion proofs for multiple nodes at once, which saves even more proof space. In this case the Verifier is able to recalculate some of the required hashes from the elements they already hold.

For instance, the example in Figure 6, below, depicts the multiproof for nodes A, B, C, and D (red-bordered boxes). Nodes B and C are part of the tree, while nodes A and D are not. Therefore, the Prover returns an inclusion proof for nodes B and C as well as the empty node in the path of node A and the leaf node in the path of node D. The multiproof includes all the hashes of the sibling nodes encountered in the path (red filled boxes), as well as an individual bitmap for each queried node. For example, the bitmap for node A would be 1011, where the 0 indicates the presence of an empty node in the path.

*Figure 6: An example of a multiproof. The queried keys correspond to leaf nodes A, B, C, and D (red-bordered boxes). Leaf nodes A and D are not part of the tree (the box border is dashed out). In this case, the non-inclusion proof is provided by proving the presence of an empty node or another leaf in the path to the queried node. The Prover provides the hashes of the nodes represented by filled red boxes as part of the proof. The Verifier can recompute the Merkle root using these hashes, starting from the queried nodes. Empty nodes in the path are indicated using a bitmap for each queried node.*

The verification process consists in recalculating the Merkle root starting from the queried node, using the sibling hashes and the bitmap provided in the proof.

Terminology Summary

**Cryptographic accumulator**: Quasi-commutative hash function with efficient inclusion proofs.**Dynamic accumulator**: Accumulator supporting efficient update (adding/deleting/updating key-value pairs).**Universal accumulator**: Accumulator supporting non-inclusion proofs.**Vector commitment:**Cryptographic primitive authenticating arrays.

New State Model

Now that we have covered the technical part, let us discuss how SMTs will be used in the new state model. As mentioned, SMTs are the natural choice to authenticate a key-value map, hence they are the perfect tool to authenticate the state store of a chain. We create a SMT on top of the state store, the **state tree**, and insert the Merkle root of this tree, the **state root**, into each block header. This provides several benefits:

**Interoperability:**The state root is used during the processing of cross-chain update transactions to authenticate the outgoing messages from one chain.**State consistency**: All nodes in the network will be able to check the consistency of the state just by checking the state root. For instance, if a block contains a transaction that randomly credits either a user A or a user B with tokens, then any node obtaining a different result from the block forger would immediately reject the block, as the two state roots would differ.**Light clients**: The state root can be used to provide efficient proofs for the state of a certain account. A blockchain created with the Lisk SDK defines a single global store. Each module registered in the chain defines its own generic key-value map, whereby each element is identified by the following properties:**Module ID**: The ID of the module, identifying the key-value map. We will reserve all module IDs starting with a 0 in their binary representation to modules that are shipped with the SDK by default. The IDs of modules developed as part of each blockchain application will start with a 1. The module ID has a fixed length of 4 bytes.**Store prefix**: Each module can define several**buckets**, to separate different data structures within the store. Each bucket is identified by the store prefix. The store prefix has a fixed length of 2 bytes.**Store key**: The store key identifies a certain element within a store bucket.**JSON schema**: The schema used in Lisk codec to serialize the value

*Figure 7: State tree of a Lisk blockchain. The Merkle root of the tree, the state root (blue box), authenticates the chain state. Each module defines its own store (colored boxes). SDK and application-specific modules are separated by imposing that their module IDs start with a 0 (SDK modules) or with a 1 (application modules) in their binary representation.*

Each element has a corresponding leaf node in the state tree, with the following properties:

File name`1 * **Key**: The key of a leaf node is the concatenation of the moduleID, the store prefix, and the hash of the store key. Prepending the module ID allows to separate different subtrees for each module, while the store prefix separates different bucket subtrees within a module subtree. The store key is hashed to ensure that leaf nodes are randomly distributed within a certain store bucket. Hashing the store key also fixes the length of a leaf key to be 4+2+32 = 38 bytes. As a consequence, the state tree has at most 38\*8 = 304 layers, although this is very unlikely to be the case given our optimizations. 2 * **Value**: The value is given by the hash of the element value. 3 * **Hash**: The hash equals the leaf hash of key and value.`

Hashing the store key and the values when creating a leaf node has the additional advantage of not having to reveal the real key and value of the entry in the underlying map when creating a non-inclusion proof.

*Figure 8: A zoom in on the state of the token module. User balances are stored in* *user store* *buckets, while information about native tokens sent to other chains is stored in the escrow store. The key of the corresponding leaf node is given by the concatenation of the token module ID (4 bytes = 32 layers), the user store store prefix (2 bytes = 16 layers), and the hash of the store key (32 bytes, Log(N) layers because of the optimizations). For the user store, the store key is given by the concatenation of the user address and the token identifier. The value is serialized using the* *user store schema**. For the escrow store, the store key is given by the chain ID. The value is serialized using the* *escrow store schema**.*

This new state architecture is substantially different from the one defined in LIP 0030. Previously, the state of a chain was organized *per account* rather than *per module*. For instance, the balance of a user would be stored together with all the other properties related to that specific user (see example in LIP 0030). Instead, in this new state model the balance of a user is stored in the token module state, separated from all other properties.

Separating the state store into several key-value maps allows us to logically compartmentalize each module, following the same mantra behind our chain architecture: *Every module defines its part of the state and its own state transitions*:

**Module state**: The key-value pairs stored in the map of the module. For example, the user balance and the escrow accounts stored in the token module.**Module state transitions**: The transactions defined in a module (for example, the token transfer transaction in the token module) as well as the logic executed with every block or transactions, such as the reward assigned to the forger after a block has been processed.

## What's New for Developers

In the SDK version 5, when creating a new module developers have to specify the JSON schemas used for serializing data structures with Lisk codec. In the next iteration of the SDK, developers will define the store prefix, store key, and the JSON schema for each data structure in their custom module. If you need any inspiration, please see the interoperability module LIP.

## Conclusion

LIP 0039 introduces sparse Merkle trees to the Lisk protocol. The new version of the SDK will implement sparse Merkle trees to authenticate the state of the blockchain into the state root, as specified in LIP "State model and state root". As such, sparse Merkle trees will be a cornerstone of Lisk interoperability, but in the future they could also be used in other parts of the protocol.

For further questions and comments on the topic, we will host an AMA on Lisk.chat with Alessandro Ricottone (Research Scientist), next Friday, June 25th at 4 pm CEST. We also invite all community members to go to the Lisk Research forum where we are always happy to hear your feedback and to participate in discussions on this and other topics.