Measuring Privacy Leakage in Term of Shannon Entropy

Differential privacy is a privacy scheme in which a database is modified such that each user’s personal data are protected without affecting significantly the characteristics of the whole data. Example of such mechanism is Randomized Aggregatable Privacy-Preserving Ordinal Response (RAPPOR). Later it is found that the interpretations of privacy, accuracy and utility parameters in differential privacy are not totally clear. Therefore in this article an alternative definition of privacy aspect are proposed, where they are measured in term of Shannon entropy. Here Shannon entropy can be interpreted as number of binary questions an aggregator needs to ask in order to learn information from a modified database. Then privacy leakage of a differentially private mechanism is defined as mutual information between original distribution of an attribute in a database and its modified version. Furthermore, some simulations using the MATLAB software for special cases in RAPPOR are also presented to show that this alternative definition does make sense.


Introduction
In digitalized era when many things can be done online, privacy becomes a more serious issue, especially if our personal data have to be submitted online for some reasons. Even with their published privacy policies (something that most users never read it properly), there are some room for privacy violations. Here we will not talk about the hackers or any outsiders, because the ones who violate privacy might come from the authorized parties.
The most annoying case is when some parties use their authorities to leak someone's private data but there is no laws or rules which can conclude it as a privacy violation and therefore they cannot be punished. For example, our medical record data which are recorded in a hospital's database. Our data, together with other persons' data, might be used by other parties who want to learn something from the database, let us say a medicine company or a medical research center. We never know if they really just access the database for gaining only the necessary information, or they may search for our personal data.
A basic and simplest way to prevent this is by hiding the names of data owners, i.e. making the data to be anonymous. Unfortunately, this may be not enough to protect our private data. They can still access any other data, such as height, weight, age, gender, etc. Consider some persons with a very rare attribute, for examples : too tall, too short, too fat, too thin, and many more. By looking at one specific attribute or two, they can uniquely determine them and as consequence, can leak their private information. They, of course, violate those persons' privacy but we cannot say that they break any laws or rules in the privacy policies. Suppose that someone is famous as the tallest guy in his/her city. Roughly saying, as long as they do not ask the hospital who the tallest guy in this database is, and the hospital do not inform it either, no laws or rules are broken.
Based on this kind of issues, many data security researchers try to create a new privacy protocol to protect any private information. One of them is called as differential privacy. The idea is to modify the original database such that each user's personal data are protected but characteristics of the whole database do not change significantly. Therefore other parties are still able to learn any information about the whole database but they are unable to learn any personal information.
As a very simple example, there are five persons : A, B, C, D and E. The fact is A and B are smokers, while the others not. After modification, the smokers become C and E. Here the fact that A and B are smokers is hidden, but it preserves the fact that two of those five persons are smokers. Note that other parties know that the database has been modified, so they cannot judge C and E as smokers. Therefore if they just want to know the proportion of smokers in the database, they will not get it wrong but they will not know who the real smokers are.
In practical case, of course, we will work on much larger database with various attributes. We do not have to preserve the exact proportion of any attributes, but we need to keep it with a small margin of errors. The concept of differential privacy will be discussed in the next section, together with some specific mechanisms which can be used.

Differential Privacy and RAPPOR
The idea of differential privacy came first in Dwork et.al. [1] in 2006. In their work, an idea to protect privacy by adding noise to the data is introduced. At that time, it had not been named as differential privacy, the name came later after some subsequent research. After few years working thoroughly on this area, a more comprehensive concept of differential privacy are later published in Dwork and Roth [2]. Concepts and definitions in this section are based on [1] and [2].

Differential Privacy
Now we go to the definition of differential privacy. Let a database is represented in a table in which the rows represent the users and the columns represent the attributes.
Sometimes the parties who have authorized access to the database only need to take some samples of users and not all of them. We do not always know what they want to look for, but we can assume that they have full authorities to do so. We say that two sub-databases are neighboring to each other if one is obtained by adding or deleting one row from the other. If the database is not modified, it is possible to learn about one specific user by learning two databases : one database that containing him/her and the database that is obtained by eliminating him/her from the previous one.

International Journal of Applied Sciences and Smart Technologies
Therefore, in order to protect that user's privacy the modification mechanism needs to eliminate this possibility. This leads to a definition of differentially private mechanism.
Equation (1) can be interpreted as the outputs from two neighboring databases has only very small and insignificant difference such that (almost) nothing can be learned about the user who differs them. If the numbers ε dan δ be smaller, then the differences become more insignificant and the privacy becomes stronger. In some specific cases, the parameter δ in (1) is set to be 0 and then the mechanism is said to be εdifferentially private.
Now we talk about the accuracy of a differential privacy mechanism. In this context we are concerned about information from a database which can be used to answer predicate counting queries. The class of those queries is called as concept class, usually denoted by . Set of any possible values of a database is called as data universe, usually denoted by . Output of a predicate counting query on a database , denoted by ( ) is the proportion of elements in which satisfy that predicate. For example, proportion of smokers or proportion of patients with heart problem in a medical record database.
Then we have this definition of accuracy.

Definition 2.2. For any cC  , a mechanism A on database x is said to be α-accurate
Above definition can be interpreted as even though each personal data has been modified, but the proportion of users who satisfy a predicate does not change significantly. We need α to be smaller for a better accuracy. Considering that it is very difficult to create a mechanism that can be accurate for all queries in a concept class, then parameter γ is introduced. If γ is smaller, then more queries can be answered accurately. Furthermore, an utility parameter of a mechanism can also be defined based on its accuracy parameter.

Definition 2.3. Let C be a concept class and X is a data universe. A modification mechanism A is said to have (α,β,γ)-utility with respect to C and X if for a database x it
There are several kind of mechanisms which can be used to modify database which satisfy differential privacy principles. In this section we will introduce the Randomized Aggregatable Privacy-Preserving Ordinal Response (RAPPOR) mechanism. The next sub-section will discuss more about RAPPOR.

Randomized Aggregatable Privacy-Preserving Ordinal Response (RAPPOR)
Let a database consists of several attributes which each of them can be divided into several categories. For example, we can categorize people according to their genders (male/female), age range ( ) and many more. Then for each attribute, each user is represented as the category he/she belongs to. To represent in which category a user belongs to, we can also use as a binary vector with exactly one 1 and 0 otherwise, where position of the 1 denotes the category he/she belongs to.
These binary vector representations will then be modified randomly based on a probability distribution and sent to other parties. Thus they will receive an already modified database. To learn about distribution of categories for each attribute, they have to take the aggregate values of each category. Because of this, later we will call them as data aggregator. The data aggregator does not know the actual distribution of categories, but he/she may know the probability distribution that is used to modify the database. However, this knowledge should not be enough to leak actual information of the entire database.

International Journal of Applied Sciences and Smart Technologies
There are several kind of RAPPOR mechanisms, as presented in Wang et.al. [3]. In this article we will discuss two kind of RAPPOR mechanisms, which are

RAPPOR with direct representation
This kind of data aggregation mechanism works as follows Let there are m categories in an attribute and a user belongs to category . After modification, user i belongs to category . Probability that user i still in his/her actual category ( = ) is γ and for each category j where probability of user i belongs to category j after modification is ( ).

RAPPOR with unary representation.
In this mechanism, category of a user i is represented as a binary vector   12 , , , if user i belongs to category j and otherwise . Then this binary vector will be modified by adding noise independently on each bit. Here a bit 0 can be flipped to 1 or vice versa. For each bit, probability of binary flip from 0 to 1 is and probability of binary flip from 1 to 0 is . If , then it is called as symmetric scheme. The modified vector is then denoted as and this will be sent to the aggregator. Note that after modification, it is possible to have more than one 1s or no 1s at all.
In next sections, we will not discuss the privacy and accuracy aspects using Definition 2.1. and Definition 2.2., but we will use a different approach instead, that is, by using concepts from information theory and we will see how it could work.

Re-defining Privacy Leakage in Term of Shannon Entropy
There are some open problems from the concepts of differential privacy explained in the previous section. For example, in a differentially private scheme, we want to determine the values of ε and δ such that its privacy can be considered as good enough and the values of α, β and γ such that it has good accuracy and/or utility. We are also interested in the practical interpretation of those parameters in a specific mechanism and how changes of one or two parameters affect the others.
It is difficult to answer those questions since we do not have a well-defined measurements of some parameters in differential privacy. Thus we might need another way of measuring the strength of privacy and accuracy. In Wang et.al. [4], an idea that linked differential privacy and mutual-information privacy was introduced. Therefore it should be possible to learn differential privacy using information theoretic approach. In this section we will use similar idea to re-define some aspects of differential privacy in the language of information theory.

Shannon Entropy and Mutual Information
Intuitively, stronger privacy will imply worse accuracy and vice versa. As a consequence, we cannot have both aspects at each highest level and we should try to find a solution for "optimizing" both privacy and accuracy. Therefore their measurements have to be "sensibly comparable". In this section an alternative definition for privacy aspect in differential privacy based on information theory point of view will be introduced. Some basic definitions in information theory, based on Cover and Thomas [5], will be revisited first.
Binary entropy function h of an event with probability p is defined as : In some books, Shannon entropy is often called just by the word "entropy". There are many interpretations of Shannon entropy. One of them is the number of binary (yes/no) questions which need to be asked in order to learn an output if the probability distribution is known. This interpretation might be not totally accurate, but it is sensible enough to define privacy aspect. If the aggregator needs to ask too many questions in order to learn about an individual data, then we can say that the privacy is strong enough.

International Journal of Applied Sciences and Smart Technologies
After being modified, a database might still give some partial information about its actual data. By learning an already modified database, an aggregator might be able to leak some actual information without knowing the original one. This "leakage" can be represented as mutual information between an original database and its modified version. The following is the definition of mutual information.

Definition 3.2. Let X and Y be two random probability distributions. Mutual
information between X and Y, denoted by I(X;Y), can be computed using these equivalent formulas: Based on Shannon entropy and mutual information as in Definition 3.1. and 3.2., we can create new definitions of privacy aspect of differential privacy. These will be discussed in next sub-section.

Alternative Definitions of Privacy Leakage
If we go to our implementation, then we can directly get an idea to define a privacy In next section we are going to do some simulations using the MATLAB software to justify whether our definitions of privacy leakage and utility really make sense or not.

Simulation using the MATLAB Software
Alternative definition of privacy leakage introduced in previous section look make sense, but sometimes we need to justify them using some simulations in real and practical cases. Here we do simulations on privacy leakage first. Since computation of a big enough database would take long enough time to compute, we start with some special cases in small database which their computations do not take much time to complete.

Case I : RAPPOR with direct representation
Recall the mechanism of RAPPOR with direct representation introduced in Section 2. In this mechanism, a user which belongs to a category will have probability to stay in his/her actual category and probability ( ) to move into each of other categories, where m denotes the number of categories. Therefore if we know the actual distribution C, we can compute the entropy of conditional probability distribution as below If a user is originally in category j (with probability ), then his/her probability to stay in category j is ( ) If he/she is originally in another category i (with probability ), then his/her probability to move to category j is ( ) ( ) Taking sum of these disjoint cases, we get a formula of that is Now we compute formula (9) of variable γ. We consider several cases with different We can see the behavior of those graph. When γ is closer to 0.5, the mutual information is closer to 0 and therefore get stronger privacy. We also see that if there are more categories, the value of mutual information is also bigger. However, we have not been able to compare multiple cases with different number of categories. Look at the fact that formula (9) depends on the value of m and if m is bigger, then ( ) shall be bigger too. This leads to a possible kind of "normalization", which makes the value of mutual information fall in interval If , the "normalized" mutual information should be equal to 1, which means that the aggregator is fully able to learn any information in the database since he/she receives the original one. Unfortunately we are yet to find a formulation about the normalization factor.

Case II : RAPPOR with unary representation
Now we move on to another case of RAPPOR with unary representation. To simplify the case, we will consider the symmetric case when (to avoid many subscripts, later they are both written as ). Each user can only belong to one category and 96 therefore his/her binary vector representation C contains exactly one 1 in his/her category's position and 0 otherwise. Given an arbitrary binary m-vector z, we will compute the probability that the binary vector representation will be modified into z.
This shall depend on how many 1s are contained in z, i.e. the Hamming weight of z, We see similar behavior with previous case. For any number of categories, their graphs are monotonically decreasing on interval The difference is that all graphs tend to 0 when . We can interpret this as the aggregator is unable to learn anything when , i.e. the binary flip is totally random. More categories also imply bigger value of mutual information, but they are also yet to be normalized. Also note that if the range of β is extended to then those graph will be monotonically increasing. Let us imagine if , then all binary vectors will be completely flipped (0 to 1 or vice versa) and the aggregator can easily determine the original ones. We can also intuitively conclude that cases when and are practically similar.
Apart from those two presented cases, we have tried to do computation for other mechanisms, but some of them have very complicated formula and be very difficult to compute. Some computations even need several days to be completed. Computation for a big enough number of categories is also yet to be done. There are two possible p-ISSN 2655-8564, e-ISSN 2685-9432 99 solutions simplifying the computation, or determining upper/lower bound of the privacy leakage which is easier to compute.

Conclusions
From what are discussed in this article, we have several points of conclusions and feedbacks for any possible further research, which are : 1. By interpreting entropies as number of binary questions which are need to asked for learning information on a database, it is possible to re-define privacy and accuracy-utility aspects of a differential privacy scheme in term of entropies. In this article the former has been done.