
Original title:Checks and balances: Machine learning and zero-knowledge proofs
Original title:
Original Author: Elena Burger, a16z
Compilation of the original text: The Way of DeFi
Over the past few years, zero-knowledge proofs on the blockchain have been used for two key purposes: (1) to scale computation-constrained networks by processing transactions off-chain and verifying results on the mainnet; (2) Protect user privacy by implementing shielded transactions that can only be viewed by those who have the decryption key. These properties are clearly desirable in the context of blockchains: a decentralized network such as Ethereum cannot increase throughput or block size without imposing unacceptable demands on validator processing power, bandwidth, and latency ( Hence the need for validity rolls), all transactions are visible to anyone (hence the need for an on-chain privacy solution).
But zero-knowledge proofs are also useful for a third class of functionality: efficiently verifying that any kind of computation (not just computation in an EVM instantiated off-chain) has run correctly. This has big implications for areas outside of blockchain as well.
Now, advances in systems that leverage zero-knowledge proofs to succinctly verify computing power enable users to demand the same level of trustless and verifiable verifiability that blockchains guarantee from every digital product, especially from machine learning models. The high demand for blockchain computing has spurred zero-knowledge proof research, creating modern proof systems with smaller memory footprints and faster proof and verification times—making it now possible to verify certain small machine learning algorithms on-chain.
By now, we've all probably experienced the potential of interacting with a very powerful machine learning product. A few days ago, I used GPT-4 to help me create an AI that consistently beat me at chess. It feels like a poetic microcosm of all the advances in machine learning over the past few decades: IBM developers spent twelve years crafting Deep Blue, a 32-node IBM RS/6000 SP computer capable of Evaluating nearly 200 million moves per second, the model that defeated chess champion Garry Kasparov in 1997. In contrast, it took me a few hours -- with minimal coding on my part -- to create a program that beat me.
Admittedly, I doubt that the AI I created could beat Garry Kasparov at chess, but that's beside the point. The point is that anyone who has played around with GPT-4 has likely had a similar experience with superpowers: you can create something close to or beyond your own abilities with very little effort. We are all IBM researchers; we are all Garry Kasparov.
Obviously, it's exciting and a little daunting. For anyone working in the cryptocurrency industry, the natural reaction (after marveling at what machine learning can do) is to consider potential avenues to centralization and how to decentralize those avenues into a system that people can transparently audit and own network of. Models today are produced by devouring vast amounts of publicly available text and data, yet only a few people currently control and own these models. More specifically, the question is not "whether artificial intelligence is of great value" but "how do we build these systems so that anyone who interacts with them can reap their economic benefits and, if they wish, ensure their Data is used in a manner that respects the right to privacy".
Recently, there have been calls to pause or slow down the development of major AI projects like Chat-GPT. Blocking progress may not be the solution: a better approach is to push open-source models and, in cases where the model provider wishes to preserve the privacy of their weights or data, secure them with privacy-preserving zero-knowledge proofs that are stored on-chain And can be fully audited. The latter use case regarding private model weights and data is not yet possible on-chain today, but advancements in zero-knowledge proof systems will enable this in the future.
Verifiable and Ownable Machine Learning
The chess AI I built using Chat-GPT appears to be relatively harmless so far: a program that outputs relatively consistent programs and does not use data that violates valuable intellectual property or violates privacy. But what about when we want to make sure that the model we're told to run behind the API is actually the model that was run? Or, what if I wanted to feed authenticated data into a model on-chain and be sure the data really came from a legitimate party? What if I want to make sure that the "person" submitting data is actually a human and not a bot trying to launch a Byzantine attack on my network? Zero-knowledge proofs and their ability to succinctly represent and verify arbitrary programs are a solution.
It should be noted that currently in the context of on-chain machine learning, the main use of zero-knowledge proofs is to verify correct calculations. In other words, zero-knowledge proofs and more specifically SNARKs (Succinct Non-Interactive Arguments of Knowledge) are most useful in the context of machine learning for their simplicity properties. This is because zero-knowledge proofs protect the prover (and the data it processes) from prying eyes. Privacy-enhancing techniques such as Fully Homomorphic Encryption (FHE), Functional Encryption, or Trusted Execution Environments (TEE) are more suitable for letting an untrusted prover run computations on private input data (discussing these techniques in more depth is beyond this article range).
Let's take a step back and take a high-level look at the types of machine learning applications that can be represented with zero-knowledge (for an in-depth look at zero-knowledge, see our article on improvements to zero-knowledge proof algorithms and hardware, and check out Justin Thaler's research on SNARK performance , or check out our zero-knowledge textbook). Zero-knowledge proofs typically represent programs as arithmetic circuits: using these circuits, the prover generates proofs from public and private inputs, and the verifier performs mathematical computations to ensure that the output of this statement is correct—without acquiring any information about the private inputs .
We are still in the very early days of verifying computations using on-chain zero-knowledge proofs, but algorithmic improvements are expanding what is feasible. Here are five ways to apply zero-knowledge proofs in machine learning.1. Model authenticity:
You want to ensure that a machine learning model that an entity claims to have run is indeed the model that was run. For example, where a certain model is behind an API, there may be multiple versions of the entity providing a particular model, such as a less expensive, less accurate version, and a more expensive, more performant version. Without a proof, you have no way of knowing if the provider offered you a cheaper model when you actually paid for the more expensive version (e.g. the provider wants to save on server costs and improve profit margins).
To do this, you need a separate proof for each model instance. A practical approach is through Dan Boneh, Wilson Nguyen, and Alex Ozdemir's Functional Commitment Framework, a SNARK-based zero-knowledge commitment scheme that allows model owners to commit to a model into which users can input their data, And receive validation of the committed model that has run. Some applications based on Risc Zero (a general-purpose STARK-based virtual machine) also implement this. Additional research by Daniel Kang, Tatsunori Hashimoto, Ion Stoica, and Yi Sun showed that efficient inference can be verified on the ImageNet dataset with 92% accuracy (comparable to the highest performing non-zero-knowledge verification ImageNet model).
But simply receiving proof that a submitted model has run is not necessarily enough. A model may not accurately represent a given program, so you will want a third party to review the submitted model. Functional commitment allows a prover to prove that it uses a committed model, but does not guarantee any information about the committed model. If we can get zero-knowledge proofs to perform well enough for proof training (see example #4 below), we might start to get these guarantees in the future as well.2. Model integrity:
You want to ensure that the same machine learning algorithm works the same way on data from different users. This is useful in areas where you don't want to apply arbitrary bias, such as credit scoring decisions and loan applications. You can also use function promises to achieve this. To do this, you need to commit to a model and its parameters, and allow people to submit data. The output verifies that the model runs with the committed parameters against each user's data. Alternatively, the model and its parameters can be made public and users can prove themselves that they applied the appropriate model and parameters to their own (authenticated) data. This can be especially useful in the medical field, where certain information about patients is required by law to remain private. In the future, this could enable a medical diagnostic system capable of learning and improving from real-time user data in complete privacy.3. Certification:
You want to incorporate certification from an external verified party (e.g. any digital platform or hardware device that can generate a digital signature) into a model running on-chain or any other type of smart contract. To do this, you will verify the signature using a zero-knowledge proof and provide the proof as input to the program. Anna Rose and Tarun Chitra recently hosted a zero-knowledge podcast featuring Daniel Kang and Yi Sun to discuss the latest developments in this field.
Specifically, Daniel and Yi recently published research on how to verify that images captured by cameras with certified sensors have undergone transformations such as cropping, Useful in cases where deepfakes have been made, but have been edited in some way legally. Dan Boneh and Trisha Datta have done similar research, using zero-knowledge proofs to verify the provenance of images.
But more broadly, any digitally authenticated message is a candidate for this form of verification: Jason Morton, who is developing the EZKL library (more on that in the next section), calls this approach "bringing vision to the blockchain." ". Any signed endpoint (e.g., Cloudflare's SXG service, a third-party notary) produces a digital signature that can be verified, which can be useful for proving provenance and authenticity from a trusted party.4. Distributed reasoning or training:
You want to perform machine learning inference or training in a distributed fashion and allow people to submit data to the public model. To do this, you can deploy an existing model on-chain, or design a completely new network and compress the model using zero-knowledge proofs. Jason Morton's EZKL library is creating a method for ingesting ONXX and JSON files and converting them into ZK-SNARK circuits. A recent demo at ETH Denver showed that this technique could be used to create an on-chain treasure hunt based on image recognition, where game creators can upload photos, generate proofs of images, and players can upload images; validators check that user-uploaded images match Creator-generated proofs match enough. EZKL can now validate models up to 100 million parameters, which means it can be used to validate ImageNet-sized models (these models have 60 million parameters) on-chain.
Other teams, like Modulus Labs, are benchmarking different proof systems for on-chain inference. Modulus' benchmarks cover up to 18 million parameters. In terms of training, Gensyn is building a distributed computing system where users can input public data and conduct model training through a distributed network of nodes while verifying the correctness of the training.5. Personal proof:
You want to verify that someone is a unique individual without compromising their privacy. To do this, you'll create a verification method, such as a biometric scan or a way to cryptographically submit your government ID. You would then use zero-knowledge proofs to check that someone is authenticated without revealing any information about that person's identity, whether that identity is fully identifiable, or a pseudonym like a public key.
Worldcoin achieves this through their Proof of Person protocol, which ensures attack resistance by generating unique iris codes for users. Crucially, the private keys created for WorldID (and other private keys for encrypted wallets created for Worldcoin users) are completely separate from the iris codes that are locally generated by the project's eye scanners. This separation completely separates the biometric identifier from any form of user key that could be attributed to someone. Worldcoin also allows applications to embed an SDK that enables users to log in with a WorldID and leverages zero-knowledge proofs to preserve privacy by allowing applications to check if a person has a WorldID, but does not allow tracking of individual users (see this blog for more details article).
This example is a form of using the privacy-preserving properties of zero-knowledge proofs against weaker and malicious AIs, so it is different from the other examples above (e.g. proving that you are a real human and not a robot while not revealing anything about yourself) information) are very different.
Model Architecture and Challenges
Breakthroughs in proof systems that implement SNARKs (Succinct Non-Interactive Arguments of Knowledge) have been a key driver of bringing many machine learning models on-chain. Several teams are making custom circuits in existing architectures (including Plonk, Plonky 2, Air, etc.). In terms of custom circuits, Halo 2 has become a widely used backend in the work of Daniel Kang et al. and Jason Morton's EZKL project. Halo 2's prover time is approximately linear, proof sizes are typically only a few kilobytes, and verifier time is constant. Perhaps more importantly, Halo 2 has powerful developer tools, making it the SNARK backend developers love to use. Other teams, like Risc Zero, are pursuing general VM strategies. Other teams are creating custom frameworks using Justin Thaler's ultra-efficient proof system based on the sum-check protocol.
Proof generation and verifier times are absolutely dependent on the hardware generating and checking the proofs and the size of the circuit generating the proofs. But the key point to note here is that regardless of the program represented, the size of the proof is always relatively small, so the burden on the verifier to verify the proof is bounded. However, there is a subtlety here: for proof systems like Plonky 2 that use FRI-based commitment schemes, the proof size may increase. (Unless a pairing-based SNARK wrapper like Plonk or Groth 16 is used at the end, these proofs don't grow with the complexity of the statement being proved.)
The implication for machine learning models is that once a proof system is designed that accurately represents the model, the cost of actually verifying the output is very cheap. The most important considerations for developers are prover time and memory: represent the model in a way that can be proved relatively quickly, and the proof size is ideally on the order of a few kilobytes. To demonstrate the correct execution of a machine learning model under zero-knowledge conditions, you need to encode and represent the model architecture (layers, nodes, and activation functions), parameters, constraints, and matrix multiplication operations as circuits. This involves decomposing these properties into arithmetic operations that can be performed on finite fields.
This field is still in its infancy. Accuracy and fidelity may be compromised during the conversion of models to circuits. When the model is expressed as an arithmetic circuit, the aforementioned model parameters, constraints, and matrix multiplication operations may require approximations and simplifications. Some precision may be lost when encoding arithmetic operations as elements in the finite field of proofs (or the cost of generating proofs without these optimizations would be overwhelming under current zero-knowledge frameworks). Furthermore, parameters and activations of machine learning models are often encoded in 32 bits to increase precision, but today's zero-knowledge proofs cannot represent 32-bit floating-point operations in the necessary arithmetic circuit format without incurring huge overhead. Therefore, developers may choose to use quantized machine learning models, where 32-bit integers have been converted to 8-bit precision. These types of models are advantageously represented as zero-knowledge proofs, but the model being verified may be a rough approximation of the higher quality initial model.
At this stage, it really is a game of catching up. As zero-knowledge proofs become more optimized, machine learning models become more complex. There are already some promising areas of optimization: Proof recursion can reduce overall proof size by allowing the proof to be used as input to the next proof, thus enabling proof compression. There are also emerging frameworks such as Linear A's fork of the Apache Tensor Virtual Machine (TVM), which introduces a converter that converts floating-point numbers into zero-knowledge-friendly integer representations. Finally, we at a16z crypto are optimistic about future work that will make it more reasonable to represent 32-bit integers in SNARKs.
Two Definitions of "Scale" Zero-knowledge proofs scale through compression: SNARKs allow you to mathematically represent an extremely complex system (such as a virtual machine or a machine learning model) such that the cost of verifying it is lower than the cost of running it. Machine learning, on the other hand, scales by scaling: today's models get better with more data, parameters, and GPUs/TPUs involved in the training and inference process. Centralized companies can run servers at virtually unlimited scale: charge a monthly fee for API calls, and pay the operational costs.
The economic reality of blockchain networks is almost the opposite: developers are incentivized to optimize their code to make it feasible and cheap to run on-chain. This asymmetry has huge advantages: it creates an environment that needs to improve the efficiency of the proof system. We should be looking to claim in machine learning the same benefits that blockchains provide, i.e. verifiable ownership and shared perceptions of truth.
While blockchains incentivize optimizing zk-SNARKs, every area of computing will benefit.