Vitalik: Demystifying the advantages of "sharding" from a technical perspective
ECN以太坊中国
2021-05-19 14:58
本文约8991字,阅读全文需要约36分钟
Sharding is key to Ethereum's future of scalability

Source | vitalik.ca

Author | Vitalik Buterin

Source | vitalik.ca

Special thanks to Dankrad Feist and Aditya Asgaonkar for proofreading.


Sharding is the future of Ethereum scalability and is the key to enabling the Ethereum ecosystem to achieve thousands of transactions per second so that most people can afford to use it and become users of Ethereum. However, sharding is one of the most misunderstood concepts in the Ethereum ecosystem, and in the broader blockchain ecosystem as well. It refers to a very specific set of concepts that have their own idiosyncrasies, but are often confused with technologies that have weaker and different security properties. The purpose of this post is to introduce specific properties of sharding, distinguishing it from other non-sharding techniques, and what sacrifices sharding systems need to make in order to achieve these properties.

image description

Legend: Ethereum sharding system, the original picture is from Hsiao-wei Wang, designed by Quantstamp

Scalable Impossible Triangle

The best way to introduce sharding is to start by describing a problem, the Scalability Impossibility Triangle, that led to the solution.

According to the Impossibility Triangle of Scalability, a blockchain wants to achieve three characteristics, but if simple technical means are used, only two of the three characteristics can be realized. The three properties are as follows:

➤ Scalability: The blockchain can process and verify more transactions than a single ordinary node, such as a consumer-grade laptop.

➤ Decentralization: The operation of the blockchain can be independent of a small group of large centralized participants. This is generally understood to mean that even if the majority of nodes are honest, one should not trust a group of nodes that cannot be accessed with a consumer-grade laptop.

➤ Security:

The blockchain can resist a large number of nodes trying to attack. Ideally, it should resist 50% of the nodes. In general, it should resist more than 25% of the nodes, but only resisting 5% of the nodes cannot guarantee security.

Below are three different types of "simple solutions", but these solutions only implement two of the three properties.

➤ Traditional blockchains include Bitcoin, Ethereum before PoS/sharding, Litecoin and other similar blockchains. These blockchains rely on every participant running a full node to verify every transaction, thus ensuring decentralization and security, but not scalability.

➤ High TPS blockchains include DPoS chains, but also cover many other blockchains. This kind of blockchain relies on a small number of nodes to maintain consensus, usually between 10-100, and users must trust the majority of nodes. As defined above, this solution achieves scalability and security, but not decentralization.

➤ The multi-chain ecosystem generally refers to the "outward expansion" of the blockchain, that is, allowing various applications to be deployed on different chains and communicate using cross-chain communication protocols. This achieves decentralization and scalability, but is not secure, because an attacker only needs to control a majority of consensus nodes in one of the chains (usually less than 1% of the entire ecosystem) to cause damage, and may cause Chain reaction, causing huge damage to applications in other chains.

Fragmentation technology can realize the three characteristics mentioned above at the same time. A sharded blockchain has the following characteristics:

➤ Scalability: The transaction volume it handles is much higher than that of a single node.

➤ Security: Attackers cannot launch partial attacks on the system through a few resources, and can only try to control the entire system for attacks.

The rest of this article discusses how sharded blockchains can achieve these advantages.

first level title

random sampling of shards

The easiest-to-understand version of sharding is sharding by random sampling. Compared with the form of sharding built in the Ethereum ecosystem, random sampling shards have weaker trust properties, but the technology applied to Ethereum sharding is simpler.


The core idea of ​​sharding is explained below. Suppose there is a PoS blockchain with a very large number of verifiers, such as 10,000 verifiers, and a very large number of blocks to be verified, such as 100 blocks. No single computer will be able to validate those 100 blocks until the next set of blocks is produced.

To solve this problem, we need to distribute the verification work in a random manner. We randomly shuffle the list of validators, and then select the first 100 validators on the list to validate the first block, the second set of 100 validators to validate the second block, and so on. Randomly sampled shards validate blocks or perform other tasks in this way, and these randomly selected validators are called committees.

After the verifier verifies a block, it will prove it by issuing a signature. All other nodes only need to verify 10,000 signatures instead of 100 full blocks, which will reduce a lot of workload, especially after applying the BLS signature aggregation technology. The broadcast of each block does not need to pass through the same P2P network, but through different subnets. Nodes only need to join the subnet corresponding to the block they are responsible for or other blocks they want to verify.

Imagine what would happen if the computing power of each node was increased by 2 times. For each node, the number of signatures that can be safely verified is now increased by 2 times, so the minimum pledge amount can be reduced, and the number of validators can be increased by 2 times, so that 200 committees can be generated instead of 100. Therefore, the number of block verifications per slot can reach 200 instead of 100. In addition, the capacity of each block can be expanded by 2 times. Therefore, the overall blockchain capacity will increase by 4 times.

We can explain the principle behind it in mathematical terms. According to Big O notation, we use "O(C)" to represent the computing power of a single node. O(C) represents the block size that a traditional blockchain can handle. As mentioned above, shard chains can process blocks of size O(C) in parallel (remember that each node has an overhead of O(1) to verify each block, since each node only needs to verify a fixed number of signatures). Therefore, the capacity of each block is O(C), and the total capacity of the shard chain is O(C^2). This is why this type of sharding is called quadratic sharding, and based on the critical role of quadratic sharding, we believe that sharding is the best way to scale blockchains in the long run.

People often ask the question: "how is random formation of 100 committees different from splitting into 100 separate blockchains?"
The difference is mainly in the following two aspects:

1. Random sampling can prevent attackers from concentrating computing power on a certain shard. In a multi-chain ecosystem consisting of 100 blockchains, an attacker can cause damage with as little as 0.5% of the total pledge, meaning a 51% attack can be launched against one of the blockchains. In a sharded chain, an attacker must own 30-40% of the total stake to achieve the same goal, in other words, the security of the chain is shared among the shards. Of course, the attacker can wait until he is lucky and accidentally obtain 51% of the computing power in a single shard. Although the pledge amount is less than 50%, for the attacker whose pledge amount is far less than 51%, The difficulty of launching an attack increases exponentially. If the stake is less than 30%, it is almost impossible to launch an attack.

2. If there is a bad block in one of the shards, the entire chain will be reorganized to avoid accepting the block, which is called tight coupling. According to the social contract, even if a bad block appears in a single shard, it cannot be accepted by the main chain. Once a bad block is found, the shard will be rejected. The following chapters of this article will introduce some technically enforced social contract methods. With this mechanism, from an application point of view, the shard chain enjoys perfect security, and contract A can trust contract B, even if due to the blockchain being attacked, contract B fails and rolls back the entire history, where It also includes transactions in contract A that are affected due to problems with contract B.

These two differences ensure that sharding creates an environment for applications that preserves the key security properties of single-chain conditions, which cannot be achieved with multi-chain ecosystems.

Improving sharding through a better security model

For more centralized high TPS chains, their main weakness is the lack of this extra security. This blockchain does not, and cannot have, a culture of regular users running nodes, so it is easier for major nodes and ecosystem players to get together and enforce a protocol change, even if the community strongly dislikes it. To make matters worse, by default, the user's node will accept this change. After a while, users will notice, but by then, the change has become a fait accompli, meaning that the main coordination burden, which is rejecting the change, will be borne by the user, and will have to make the painful decision to roll back a day or More transaction records, which other users think have been finalized.

Ideally, we would like to adopt a form of sharding whose verification method avoids the 51% trust assumption mentioned above, and retains the high security of traditional blockchains, which can only be fully verified. Only then can it be realized. And that's what most of our research has done over the past few years.

first level title

Scalable Computational Verification

We can divide the scalable verification problem that can resist 51% attack into two cases:

➤ Validate calculations: check that certain calculations are done correctly, assuming you have all the input data to complete the calculation

➤ Validate data availability: checks that the data input to the calculation itself is stored in some form and made available for download if necessary; this check is performed without actually downloading all input data, as the data may be too large to be Both are downloaded.

Block validation in the blockchain involves both computation and data availability checks, i.e. you need to be confident that the transactions in the block are valid and that the new state root hash in the block is the correct execution of those transactions, but you also need Make sure that enough data in the block is actually published so that the user who downloaded the data can compute the state and continue processing the block. The second point concerns a very subtle but important concept, the data availability problem. This issue will be discussed below.

Scalable computational verification is relatively easy to implement, using two types of techniques: fraud proofs and ZK-SNARKs

Fraud proofs can verify calculations while ensuring scalability

The following is a brief introduction to the two types of technologies:

➤ A fraud proof (fraud proof) is a system that accepts the result of a calculation, and you can ask someone with a pledge deposit to sign a message of the form: "I prove that if I use input X to calculate C, I get output Y". You will trust the message by default, but other people with pledged deposits will have the opportunity to challenge the calculation result. They can sign a message saying "I disagree, the output should be Z, not Y." After only issuing a challenge, All nodes will perform calculations. An error by either of these two parties will result in the loss of the deposit and all calculations based on the miscalculation will be redone.

Calculations based on fraud proofs scale because in the "normal case" you don't have to run complex computations, you only need to verify a single signature. There are special cases where you have to verify the computation on-chain after the challenge, but special cases are rare because triggering this is very expensive as one of the original claimers or challengers loses a lot of their deposit. ZK-SNARKs are conceptually simpler, they just replace calculations with cheaper proofs for verification, but the math behind them is much more complex.

There is a semi-scalable system that verifies computation in a scalable form but requires every node to verify all data. If the system can replace most of the data through calculations through a series of compression techniques, the efficiency can be greatly improved. That's what Rollup does.

first level title

Scalability verification of data availability is more difficult

Fraud proofs cannot be used to verify data availability. The fraud proof of an operation is based on the condition that once the original statement is submitted, the input data of the operation is published on-chain, so that if someone initiates a challenge, the execution of the challenge is exactly the same as the "environment" of the original execution. For data availability checks, the above operations cannot be implemented, because if it is to be published on the chain, the amount of data that needs to be checked is too large. Therefore, for data availability, how to generate a fraud proof scheme has become a key issue. Someone can claim that "data X is available", but not publish it on the chain, wait for a challenger to appear, and then publish the data to the entire network after launching a challenge. make other participants in the network think that the challenger is incorrect

The "fisherman's dilemma" in the figure below can explain the reason very well:

The core idea of ​​"Fisherman's Dilemma" involves two situations, one is that V1 is a malicious publisher but V2 is an honest challenger, and the other is that V1 is an honest publisher and V2 is a malicious challenger. The two cases make no difference to anyone who wasn't trying to download that particular data at the time. Of course, in a scalable decentralized blockchain, each individual node is only expected to download a small portion of the data, so only a small fraction of nodes will be able to know everything beyond divergence.

Since it is impossible to discern which side is correct, it is also impossible to generate an effective fraud proof solution for data availability.

People often ask: "What if some data is not available? ZK-SNARK can ensure the validity of everything, but is that not enough?"

Unfortunately, data validity alone is not enough to keep a blockchain functioning. The reason is that if the blockchain can be verified, but all the data is unavailable, users will not be able to update the data and generate proofs to verify future blocks. If an attacker can generate a block, although the block can be verified, the data is unavailable, which can effectively hinder the operation of the blockchain. Some attackers can also not upload a specific user's account data until that user pays the ransom, so it's not just a liveness issue.

There are some strong information-theoretic arguments that this problem is fundamental and has no good solution (such as the use of cryptographic accumulators). See this article for details.

So, how do you check if 1 MB of data is available without downloading it? This sounds impossible!

The key solution is a technique called data availability sampling. The technique works as follows:

1. Through the erasure code tool, the data with N segments is divided into 2N segments of data, so only any N data segments can restore the entire data.

2. If the user wants to check the availability, instead of downloading all the data, the user randomly selects the position in the block (constant, such as 30), and only accepts the block when all the data in the selected position is found in the block .

With erasure coding, we were able to change the problem from "check 100% data availability" (i.e. ensure that every piece of data is available) to "check 50% data availability" (i.e. at least half of the data is available). Random sampling solves the 50% usability problem. If less than 50% of the data is available, then at least one of these two checks is not feasible, and if at least 50% of the data is available, then, although some nodes may not know the availability of a block, only one Honest nodes can recover the remaining 50% of the block data by running the erasure code reconstruction program. So, in order to check the availability of a 1 MB block, you don't need to download 1 MB of data, just a few KB. This allows each block to be subject to data availability checks. See this post on how to use P2P subnets for efficient data inspection

Through the ZK-SNARK proof, the correctness of the data erasure code can also be verified, and then the branches of the Merkle tree are used to verify each data block. Another way to verify is to use polynomial commitments, such as Kate commitments (KZG commitments), essentially, the commitment is erasure-coded through a simple component that proves each element and correctness verification, which is Ethereum. technology used by the film.

Summary: How to ensure the correctness of all data?

Say there are 100 blocks, and you don't want to rely on the committee to efficiently verify the correctness of all blocks. In order to achieve this goal, we need to carry out the following measures:

  • ➤ Each client performs data availability sampling on each block to verify whether the data in each block is available, and at the same time needs to download a few KB of data per block, even if the overall size of the block is MB or more big. Blocks are only accepted by client peers after all data availability challenges have been correctly answered.

  • ➤ Once data availability is verified, it becomes easier to verify its correctness. Correctness is verified by two techniques:

We can use Fraud Proofs, where some participants who have staked a deposit can provide signatures proving the correctness of each block. Other challengers or fisherman nodes will do random checks and try to process the entire block in its entirety. Because data availability is already checked, other nodes can always download the data and fully process any particular block. If an invalid block is found, the node issues a challenge that everyone can verify. If the block proves to be a bad block, all blocks based on this block need to be recalculated.

We can use ZK-SNARK technology. In this way, the correctness of each block can be verified by this technology.

➤ In the above two cases, no matter how big the block is, each client only needs to do a small amount of verification work on the block. For fraud proofs, blocks occasionally need to be fully validated on-chain, but this is rarely the case because even a challenge would be prohibitively expensive.

The above is the summary of the full text! In the case of Ethereum sharding, the short-term plan is to have blocks in shards contain only data. That is, the role of these shards is purely "data availability engine", and the job of Layer2 rollup is to use secure data space, and also utilize fraud proof or ZK-SNARK technology to achieve high transaction throughput while maintaining security sex. However, it is also possible to create an internal system for high-throughput execution "in-place".

What are the key properties of a sharding system? What are the tradeoffs?

The main goal of sharding is to inherit as much as possible the most important security properties of traditional non-sharded blockchains, while eliminating the need for every node to verify every transaction.

Fragmentation can basically meet these requirements. The following are the characteristics of a traditional blockchain:

➤ An invalid block cannot be added to the blockchain because validating nodes will detect that the block is invalid and ignore it.

➤ Blocks for which data is not available cannot be added to the blockchain because validators cannot download the data and choose to ignore.

  • The following are the characteristics of a sharded blockchain with strong security:

  • ➤ Invalid blocks cannot be added to the blockchain for the following reasons:

  • A fraud proof would detect the block quickly, inform the entire network that it was an incorrect block, and severely penalize the creator.

  • Or verify its correctness through ZK-SNARK, because valid ZK-SNARK proofs cannot be generated for invalid blocks.

  • ➤ Blocks with unavailable data cannot be added to the blockchain for the following reasons:

If less than 50% of the block's available data is available, it is almost certain that at least one sample check of data availability will fail for each client, causing the client to reject the block.

If at least 50% of the block data is available, then effectively the entire block data is available since only one honest node is needed to recover the rest.

The traditional high TPS chain cannot achieve the above characteristics because there is no fragmentation. The problem faced by the multi-chain system is that if the attacker chooses a chain to attack, he can easily gain control, and the chains in the system can also share security, but if the security is low, it will be no different from the traditional high TPS chain. It will also inherit all the shortcomings of the traditional blockchain. If the security is high, shared security is just a more complex implementation of the above sharding technology.

Sidechains are highly implementation-dependent. If they share miners or verifiers, they are usually prone to the weaknesses of traditional high TPS chains; if they do not share miners or verifiers, they will also face the weaknesses of multi-chain ecosystems. Shard chains avoid these problems.

However, the sharding system also has some hidden dangers. Especially in the following aspects:

➤ In the event of an attack by an adaptive adversary, the shard chain that only relies on the committee is difficult to cope with, and it is difficult to hold accountable. That is, if an attacker is able to compromise or choose to shut down any set of nodes in real time, then only a small number of nodes need to be attacked to disrupt a committee. In addition, no matter whether the attacker has strong resilience or owns 50% of the total pledge, if a committee is destroyed, the entire network can only confirm a small number of nodes participating in the attack, that is, the nodes in the committee, and the resulting penalty amount only accounts for a small amount of pledge. This is yet another key reason why data availability sampling, combined with fraud proofs or ZK-SNARKs, is an important complement to random sampling techniques.

➤ These duplicate responses always constitute at least 50% of the block data only if the number of online clients is sufficient to generate enough data availability sampling requests. In practice, this means that several hundred clients must be online, and the larger this number, the higher the ratio of system capacity to individual node capacity. This is a few-of-N trust model - generally very trustworthy, although of course it is not as robust as the 0-of-N trust model in terms of data availability for non-shard chain nodes.

➤ If a shard chain relies on fraud proofs, it is based on the timing assumption that if the network is too slow, nodes may have finalized a block before the fraud proofs show that the data is wrong. Fortunately, if you strictly follow the rules, once an invalid block is found, all invalid blocks will be rolled back. The time period parameter is set by the user. Waiting too long could cost you money, but more cautious users are also safer. Even so, this mechanism can weaken the user experience. Using ZK-SNARK to verify the validity can solve this problem.

➤ The amount of raw data that needs to be transferred is much larger, increasing the risk of failure under extreme network conditions. Small amounts of data are easier to transmit than large amounts of data, and easier to safely hide if a powerful government tries to censor the blockchain. If the blockchain browser wants to maintain the information of the entire chain, it needs to store more data.

These are issues that data validation needs to focus on, although in our view, reducing user layer centralization by having more applications run on-chain rather than through centralized layer 2 services would be more noteworthy than the above aspects. That said, in practice these issues, especially the last two, impose real limitations on increasing the throughput of shard chains beyond a certain size. Quadratic sharding can only achieve limited quadraticity.

By the way, if the throughput is too high, the security risk of the shard chain will increase day by day. To a large extent, this is also the main reason for giving up the expansion to super quadratic sharding. A quadratic shard that retains its finite quadraticity seems to be a suitable intermediate value.

first level title

Block production is centralized, but is verification fragmentation feasible?

People often propose an alternative to sharding, which is to use a structure similar to a centralized high TPS chain. In addition, use data availability sampling and sharding to verify data validity and availability.

This solution can improve the existing centralized high tps blockchain, but it is still far less powerful than the sharding system. Some of these reasons are as follows:

1. In high TPS chains, it is more difficult to detect the review behavior of block producers.

Monitoring censorship requires either: (i) being able to see every transaction and verifying that no legitimate transactions were somehow not entered, or (ii) using a 1-of-N trust model among block producers and verifying that no The block cannot be uploaded to the chain. In a centralized high TPS chain, the first point is impossible to achieve, and the second point is more difficult to achieve, because the number of nodes is small, and even the 1-of-N trust model is easier to be destroyed, and if the block time of the chain Too fast for DAS (Data Availability Sampling) (as most centralized high TPS chains do), it's hard to prove that nodes' blocks won't be rejected just because they're published too slowly.

2. If the majority of block producers and ecosystem members try to enforce a protocol change, although the change is unpopular, users' clients will definitely detect the change, but for the community, rejecting the change, making a fork Much more difficult due to the need to run new high-throughput nodes which are expensive to maintain a blockchain based on the old rules.

3. In a centralized infrastructure, it is easier for external attackers to implement censorship. Block production nodes have high transaction throughput and are very easy to detect, and it is easy to shut down these nodes. Reviewing dedicated HPC is politically and logistically much easier than doing it on a single user's laptop. Edit: Auditing dedicated HPC is easier in logic and practice than tracking individual user laptops.

4. The transfer of high-performance computing to centralized cloud services will face greater pressure and increase risks, because the entire chain will run in the cloud services of 1-3 companies. If many block producers fail at the same time, it will increase Risk of large blockchain crashes. Sharded chains where validators all run on individual hardware are less vulnerable to this attack.

After the system is properly fragmented, it can be more suitable as the base layer. Based on a sharded base layer, you can always create a centralized production system by building Rollups (such as high-throughput fields with synchronous composability for DeFi). However, if the base layer relies on centralized block production, a more decentralized Layer 2 cannot be built on top of it.

ECN以太坊中国
作者文库