Database Anonymization. David Sánchez

Чтение книги онлайн.

Читать онлайн книгу Database Anonymization - David Sánchez страница 4

Database Anonymization - David Sánchez Synthesis Lectures on Information Security, Privacy, and Trust

Скачать книгу

protection mechanisms that are more concerned with the utility of the data and offer only vague (i.e., ex post) privacy guarantees, whereas PPDP seeks to attain an ex ante privacy guarantee (by adhering to a privacy model), but offers no utility guarantees.

      In this book we provide an exhaustive overview of the fundamentals of privacy in data releases, including privacy models, anonymization/SDC methods, and utility and risk metrics that have been proposed so far in the literature. Moreover, as a more advanced topic, we discuss in detail the connections between several proposed privacy models (how to accumulate the guarantees offered by different privacy models to achieve more robust protection and when are such guarantees equivalent or complementary). We also propose bridges between SDC methods and privacy models (i.e., how specific SDC methods can be used to satisfy specific privacy models and thereby offer ex ante privacy guarantees).

      The book is organized as follows.

      • Chapter 2 details the basic notions of privacy in data releases: types of data releases, privacy threats and metrics, and families of SDC methods.

      • Chapter 3 offers a comprehensive overview of SDC methods, classified into perturbative and non-perturbative ones.

      • Chapter 4 describes how disclosure risk can be empirically quantified via record linkage.

      • Chapter 5 discusses the well-known k-anonymity privacy model, which is focused on preventing re-identification of individuals, and details which data protection mechanisms can be used to enforce it.

      • Chapter 6 describes two extensions of k-anonymity (l-diversity and t-closeness) focused on offering protection against attribute disclosure.

      • Chapter 7 presents in detail how t-closeness can be attained on top of k-anonymity by relying on data microaggregation (i.e., a specific SDC method based on data clustering).

      • Chapter 8 describes the differential privacy model, which mainly focuses on providing sanitized answers with robust privacy guarantees to specific queries. We also explain SDC techniques that can be used to attain differential privacy. We also discuss in detail the relationship between differential privacy and k-anonymity-based models (t-closeness, specifically).

      • Chapters 9 and 10 present two state-of-the-art approaches to offer utility-preserving differentially private data releases by relying on the notion of k-anonymous data releases and on multivariate and univariate microaggregation, respectively.

      • Chapter 11 summarizes general conclusions and introduces some topics for future research. More specific conclusions are given at the end of each chapter.

      CHAPTER 2

       Privacy in Data Releases

      References to privacy were already present in the writings of Greek philosophers when they distinguish the outer (public) from the inner (private). Nowadays privacy is considered a fundamental right of individuals [34, 101]. Despite this long history, the formal description of the “right to privacy” is quite recent. It was coined by Warren and Brandeis, back in 1890, in an article [103] published in the Harvard Law Review. These authors presented laws as dynamic systems for the protection of individuals whose evolution is triggered by social, political, and economic changes. In particular, the conception of the right to privacy is triggered by the technical advances and new business models of the time. Quoting Warren and Brandeis:

      Instantaneous photographs and newspaper enterprise have invaded the sacred precincts of private and domestic life; and numerous mechanical devices threaten to make good the prediction that what is whispered in the closet shall be proclaimed from the house-tops.

      Warren and Brandeis argue that the “right to privacy” was already existent in many areas of the common law; they only gathered all these sparse legal concepts, and put them into focus under their common denominator. Within the legal framework of the time, the “right to privacy” was part of the right to life, one of the three fundamental individual rights recognized by the U.S. constitution.

      Privacy concerns revived again with the invention of computers [31] and information exchange networks, which skyrocketed information collection, storage and processing capabilities. The generalization of population surveys was a consequence. The focus was then on data protection.

      Nowadays, privacy is widely considered a fundamental right, and it is supported by international treaties and many constitutional laws. For example, the Universal Declaration of Human Rights (1948) devotes its Article 12 to privacy. In fact, privacy has gained worldwide recognition and it applies to a wide range of situations such as: avoiding external meddling at home, limiting the use of surveillance technologies, controlling processing and dissemination of personal data, etc.

      As far as the protection of individuals’ data is concerned, privacy legislation is based on several principles [69, 101]: collection limitation, purpose specification, use limitation, data quality, security safeguards, openness, individual participation, and accountability. Although, with the appearance of big data, it is unclear if any of these principles is really effective [93].

      Among all the aspects that relate to data privacy, we are especially interested in data dissemination. Dissemination is, for instance, the primary task of National Statistical Institutes. These aim at offering an accurate picture of society; to that end, they collect and publish statistical data on a wide range of aspects such as economy, population, etc. Legislation usually assimilates privacy violations in data dissemination to individual identifiability [1, 2]; for instance, Title 13, Chapter 1.1 of the U.S. Code states that “no individual should be re-identifiable in the released data.”

      For a more comprehensive review of the history of privacy, check [43]. A more visual perspective of privacy is given by the timelines [3, 4]. In [3] key privacy-related events between 1600 (when it was a civic duty to keep an eye on your neighbors) and 2008 (after the U.S. Patriot Act and the inception of Facebook) are listed. In [4] key moments that have shaped privacy-related laws are depicted.

      The type of data being released determines the potential threats to privacy as well as the most suitable protection methods. Statistical databases come in three main formats.

      • Microdata. The term “microdata” refers to a record that contains information related to a specific individual (a citizen or a company). A microdata release aims at publishing raw data, that is, a set of microdata records.

      • Tabular data. Cross-tabulated values showing aggregate values for groups of individuals are released. The term contingency (or frequency) table is used when counts are released, and the term “magnitude table” is used for other aggregate magnitudes. These types of data is the classical output of official statistics.

      • Queryable databases, that is, interactive databases to which the user can submit statistical queries (sums, averages, etc.).

      Our focus in subsequent chapters is on microdata releases. Microdata offer the greatest level of flexibility among all types of data releases: data users are not confined to a specific prefixed view of data; they are able to carry out any kind of custom analysis on the released data. However, microdata releases are also the most challenging for the privacy of individuals.

      A microdata set can be represented as a table (matrix) where each row refers to a different individual and each column

Скачать книгу