Rethink and understand "data is oil": How does privacy computing protect data sovereignty?
Winkrypto
2021-07-05 08:33
本文约5104字,阅读全文需要约20分钟
At this moment, the value and security of data are increasingly worth rethinking.

This article, first published in 2019, introduces the most basic introductory knowledge for understanding the privacy computing business model. Original title: "Data is more valuable than oil, but how to achieve it?" "; Author: Li Hua.

"Brokers" published a cover article as early as 2017, saying that "data will replace oil" as the most valuable resource in today's era. But until today, ordinary people who own the sovereignty of "data oil" still cannot benefit from this precious resource.

On the contrary, these data also bring serious privacy leakage problems to their owners.

Why is there a huge gap between the beautiful vision and the reality? How can data ownership and data value be realized? This article tries to discuss from the existing practice, hoping to clarify some clues and contribute to the establishment of a thinking framework on this issue.

we cannot sell data

I believe that each of us has had the experience of receiving sales calls. The vast majority of people's personal data has been bought and sold, the simplest such as phone numbers and some consumer information, these data may be waiting to be sold again somewhere at this moment.

Data does sell for money, and the money goes to the institutions that have access to our data.

This phenomenon tends to lead to a misunderstanding, that is, we think that we can realize the value of data by selling data, that is, after we have data sovereignty with the help of legal provisions and technical means, we can sell these data to those who need it. To obtain data value and sell "oil" for money.

But this is wrong, we cannot buy or sell data. Before elaborating on this issue, it is necessary for us to distinguish between data ownership and data use rights.

For the vast majority of assets in the world, buying and selling means the transfer of asset ownership: one party gains ownership, and the other party loses ownership. But buying and selling data will not transfer the ownership of the data. You sold the data, but the ownership of the data still belongs to you.

Therefore, transactions around data are actually transactions around data usage rights, not data ownership. But because data can be copied infinitely, if we sell the data, there is no guarantee how the buyer will use it and whether the data will be sold again. More precisely, we have "lost" the data to some extent, even if We own the data.

Illegal data transactions directly buy and sell data because they don't care about the rights and interests of data owners, but when we really own data ownership, in order to realize the value of data, we cannot buy or sell data.

So how do you trade access to data without losing it? The answer is not to trade the data itself, but only the calculation results of the trade data. That is to say, the buyer can use these data to perform calculations and obtain the desired results, but the buyer cannot obtain the original data itself.

This is the first and perhaps most important thing to understand when we discuss data ownership and data value: we cannot realize data value by selling data, only by selling data results.

In other words, we need to separate the ownership of data from the right to use, and only trade the right to use data.

Privacy computing is not just for user privacy issues

How to realize the result of only selling data? The answer is: through private computing.

Privacy computing is to calculate data without exposing the original data, and the calculation results can be verified. It includes multiple research directions such as fully homomorphic encryption and secure multi-party computing. There are many professional technical articles introducing their working principles. If you want to know more about them, you can check them out.

Here we have a second ambiguity that needs to be clarified, that is: privacy computing is not only for protecting user privacy, but also the basis for realizing data usage rights transactions, that is, the basis for realizing data value.

The reason why this clarification is needed is because "privacy computing" is easily understood as another privacy protection technology, and the focus is placed on "privacy", but in fact the focus of "privacy computing" is on "computation".

In the blockchain industry, since privacy computing is often used in cryptocurrency transactions and on the blockchain as a method to enhance user privacy, it is easier for people to understand privacy computing as serving the realization of user privacy. This understanding is not wrong, but it limits privacy computing to a small field.

Perhaps it will be clearer to look at it from another angle. We split the data issue into user privacy issues and data value issues. The problem of user privacy is to solve the problem that the original data related to the user will not be disclosed, and the privacy of the user will not be exposed. We can regard this problem as a kind of data privacy protection within a specific range.

At this stage, the role of private computing is an alternative approach to privacy protection.

After the user gets data privacy, if he/enterprise chooses to put the data there and do nothing, the story is over; but if the user/enterprise wants to go further and get the value of the data, they must take the data out and use it. Things have entered the next stage. At this time, it is necessary to use various methods to ensure that the data is not leaked throughout the entire life cycle of being used. We can regard this as a full range of data privacy protection.

At this stage, the role of privacy computing is no longer an optional method, but a necessary path, because the way to realize the value of data is to sell the data results without exposing the original data, and carry out data usage. Only privacy computing can achieve this goal.

If data is compared to oil, then privacy computing is the first process of oil refining. It is the basis for us to convert "crude oil" into various products under the premise of ensuring user privacy.

Not all data has similar value

Not all data has similar value, and not all data can achieve data value. This may be another place we need to be clear when discussing data value.

Only when we understand the complexity and diversity of data, will it be possible to use different terms and methods legally and technically for different situations to really solve the problem.

This article will try to make a simple division of data categories from the application point of view, and then introduce the data value of this type of data. The data classification method proposed here is not necessarily comprehensive and accurate, it just serves to establish a basic framework for discussion.

We can divide the data into three categories:

  • The first category is identity data;

  • The second category is behavioral data;

  • The third category is productivity value data.

The first type of identity data is used for registration and identity determination on the Internet and in the real world, such as ID number, phone number, account information, etc. This type of information has the greatest value for the illegal industry, and once leaked, it will also bring serious harm to users. A big safety hazard. But for the formal data industry, this kind of information has no computational value, and they cannot calculate meaningful results.

Therefore, this type of data itself does not need to consider how to realize data value through privacy computing.

The second category is behavioral data, which includes user browsing traces on the Internet, consumption data, and user product usage habits data. These data can be calculated to make personal portraits of users, and then push advertisements, push content, provide services, and even sell opinions to users based on the portraits.

Behavioral data has two types of value. One is the value of advertising. We all know that almost advertising supports the entire Internet industry; the other is that it can help products understand users and provide users with better personalized services.

The current data ownership issues that are widely concerned and discussed around the world mainly focus on this type of data. For a long time, the various permissions of this type of data have not been clear, and people have not paid attention to it. We did not realize the seriousness of the problem until the calculation results of these data were used more and more to influence or control us .

The landmark event is the Facebook data gate incident in 2018. In this incident, a data operation company called Cambridge Analytica obtained the data of more than 50 million Facebook users. Through data calculations, they screened out those who swayed in political positions and placed precise matching political propaganda ads on them, thus Influencing the U.S. election and the U.K. Brexit referendum.

The good news is that we appear to be taking back ownership of this type of data. The General Data Protection Regulation (GDPR) promulgated by the European Union stipulates that the individual who generates the data is the data subject, and he has the right to request the erasure of his personal data, as well as the right to object and request to stop the processing of his personal data.

The bad news is that we did not get back the right to use the data. As mentioned earlier, the value of data is based on the transaction of the right to use data, so we are still far away from using this type of data to realize the data value attributable to users. . Its difficulty lies in:

On the one hand, even if it is called the most stringent data protection regulation in history, GDPR only requires companies to inform users of what data is being used and what to do with the data before using the data, that is to say, it only restricts companies from abusing data, but does not restrict the use of data by enterprises.

On the other hand, because this type of data can be used to help products understand users, it seems hard to say no to companies using data on the grounds of improving user experience — which they are doing now. It seems difficult for users to sacrifice user experience to demand that companies have no right to use any behavioral data, and it seems even more difficult to ask companies to actively separate the two uses of such data and transfer part of the advertising value.

Does this mean that businesses can still do things the way they used to with data? Not really. We will find that the above-mentioned separation of data ownership and usage rights is only literal. Although companies only have the right to use data, they "obtain" and use the original data itself, which makes the data still exist for abuse and security. aspects of the problem.

And because of the awakening of public privacy awareness and the promulgation of data protection laws in various countries (putting security responsibilities on companies that use data), once problems arise, companies may face resistance from users and huge fines, so we can see that Google, Companies such as Apple are doing a lot of research in the field of private computing today.

Taking Google as an example, its "Federated Learning" integrates machine learning models into each device, and realizes privacy computing through privacy-preserving aggregation algorithms and system engineering when summarizing user parameters and sending them to the cloud.

But it needs to be pointed out again that the separation of data ownership and use rights by enterprises through privacy computing is not for users to trade data use rights. They hope to reduce the risk of data use and avoid accusations of privacy leaks. Compliance requirements continue to use user data for free.

Therefore, it is a long way for users to obtain the data value of this type of data. The biggest difficulty lies in awareness. Only when we have a strong awareness of data ownership and usage rights can we push the government to introduce stricter data protection regulations , or promote a new Internet architecture to subvert today's centralized server model.

"Productivity value data" is the most valuable

After understanding "identity data" and "behavior data", we will introduce the third type of data, which we call "productivity value data" in this article.

A major use of this type of data is to do machine learning and train AI; another major use is to do data analysis to help with scientific research, product design, decision making, etc. If this type of data is used properly, it can drive society to develop in a more efficient and friendly direction. They are a kind of productivity.

The third type of data has the widest collection range and the largest amount of data. It can come from humans, such as personal medical data and financial data, personal product usage habit data, etc.; it can also come from IoT devices, such as atmospheric condition data collected by sensors, autonomous driving data, and so on.

Some of its data sources are the same as the second type of data, which are users of Internet products, but the processing methods and purposes of the collected data are different: the second type of data is obtained from users and used for users, while the second type of data is obtained from users and used for users. The three types of data are aggregated and used across data subjects. From the perspective of the data itself, we can consider a certain data as both the second type of data and the third type of data.

The third type of data has the greatest data value, and they may also be the first to enter the trading market of data usage rights to realize data value.

Different from the second type of data, Internet companies have the right to use the data and use the data themselves, and do not need to conduct data transactions. In the application scenario of productivity value data, there are roles that do not own the right to use the data but want to use the data. From this perspective, we can think that the third type of data refers to the collection of all data that can be capitalized.

We can take medical data as an example to better understand how to use the third type of data. If scientific research institutions or pharmaceutical factories are supported by a large amount of medical data, they can research diseases and develop new drugs better and faster. However, medical institutions with data resources will not share these data because of user privacy issues and their own interests. available to other institutions.

If we separate the ownership and use rights of data through privacy calculations, we can establish a trading market for data use rights, and the data of different medical institutions, scientific research institutions, and pharmaceutical factories can be connected on this platform-the popular saying is Breaking down data silos—these institutions can trade data, or share data for joint disease research.

If we want to train AI capable of diagnosing diseases, we also need to break the data islands through the above methods, so as to provide AI with more and more comprehensive data.

What needs to be repeated is that at this stage, even if the transaction and value of data are realized, because the legal and usage boundaries of data usage rights are not clear, it is still difficult for us as individuals to get back all the value of data.

Data ownership and access is one of the most important issues of our time. According to historian Yuval Noah Harari, author of "A Brief History of Humanity," "If we want to avoid Centralization in the hands of a small group of elites is all about regulating data permissions.”

Because of the complexity and diversity of the data itself, it may be fast and effective to define and solve problems from the small points with clear boundaries and accurate descriptions, rather than hoping that public opinion, legislation and technology can solve the problem as a whole method. We can classify and analyze different data categories more specifically, or use different classification standards to discuss data classification, and then discuss data privacy, data ownership, and data value realization issues based on this.

Re-understanding "data is oil"

Data is often compared to oil.

Although there are records of human beings collecting natural oil along the coast of the Dead Sea in cuneiform, it was not until 1846 that Abraham Kisner invented the method of extracting kerosene from coal, and in 1853 Ignacy Vukasiewicz and Jan The history of the modern petroleum industry really began when refined kerosene was fractionated from crude oil.

But this is just the beginning. Petroleum as fuel for kerosene lamps is not special. Only when it is used in internal combustion engines later, it explodes with great potential and becomes the most important resource in the world.

The similarity between data and oil is that data alone is not enough. Only by realizing the "refining technique" of data can it be possible to open the era of data industry.

The difference between data and oil is that oil has refineries first, and then there is a demand for internal combustion engines, while data has a huge demand for use, but there is no mature technology and infrastructure to support this demand.

References:

References:
1.《Federated Learning: Collaborative Machine Learning without Centralized Training Data》
2.《Helping organizations do more without collecting more data》

Winkrypto
作者文库