The Question of the Questions
By David Sheidlower
Incessant questioning can reduce the best thinking to no more than a background chorus of "Are we there yet?" But there are still some things that have to be asked.
I have spent the past four articles observing how aggregation is emerging as more than just an automated process. I’ve tried to show the following:
- Aggregation is a process that is independent of the data it aggregates
- The individual’s relationship to data about them is changing
- Aggregation is the inverse of broadcasting
- The bias of aggregation is towards anonymity
- The actions of the aggregator can accelerate regression towards the mean
- An individual can be in a cohort without realizing it and certainly without choosing to be
- The world’s increasing reliance on cohorts creates a different kind of identity for the individual
- Because of all of the above, aggregation is creating a new monopoly of knowledge
It seems important to list the questions that I can’t find definitive answers to. I am posing the questions below. Those involved in maintaining and expanding the data-centric world mostly behave as if they’re either certain of the answers, or at the very least, never seriously consider the questions.
Certainty can destroy useful discovery. And if I'm correct that we are undergoing a transition to a new data-centric, aggregation driven monopoly of knowledge, then it seems like a very good time to ask questions about that.
Does de-identifying result in dehumanizing?
Protecting individual privacy is increasingly important as more and more data are collected. This is so universally understood that even the United States Supreme Court, known for its bias towards allowing law enforcement wide latitude, surprised many that follow it when it ruled that the aggregated data on a cell phone required a different level of privacy protection from almost everything else found in a suspect’s pocket. (Presumably this protection would extend to a flash drive, but as far as I know that has not been tested by the Supreme Court.
Experts and regulators recognize de-identifying data as a strong way of protecting an individual’s privacy. When done systematically, stripping a record of the elements that tie it to the individual subject of the data (e.g., name, full address, full birthdate, social security number, etc.) can even be measured for its effectiveness in making the data anonymous.
But if de-identifying data results is removing the ability of those that use the data to identify the subject of the data, it also limits the subjects from having any participation in its use. Perhaps something is lost in this process that should not be. I’ve already discussed how problematic the idea of “consent” is when the subject knows that their information is being collected and has an idea how it will be used. Given those difficulties, there are clearly obstacles that would need to be overcome for the subjects of data to meaningfully participate in the handling of their data once it is de-identified.
Subject participation is not the only area where de-identified data might be at risk to be dehumanized. The analytic techniques that make large de-identified datasets useful include methods for dealing with outliers, “margins of error” and even those missing from the dataset (inference). Valid as the techniques themselves are, the question remains whether or not they create problems when an individual is treated merely as a member of a cohort.
It is not my intention here to evoke the image of some idealized past where individuals had some mythical power over their surroundings. Over time, the power of individuals has been as varied as the individuals themselves. Our relationship to the communities we are part of has always been one where we are members of them and, for better or worse, they are larger than us.
The question here is how does our identity in groups change when the groups are defined by distilling our identity down to a limited set of characteristics that are optimized for focused purposes? Especially when those purposes may or may not be related to the overall interests of the group itself.
Is surveillance always welcome?
It’s a bit of a trick question. Most people I talk to will answer quickly “of course not; you can’t speak in absolutes. The word ‘always’ is too definitive.” And yet the apologists for surveillance do seem to take for granted that their mission implies a different answer.
Consider how James Clapper, U.S. Director of National Intelligence, described his mission. In September 2014, with tongue in cheek and a friendly audience (the AFCEA/INSA National Security and Intelligence Summit), Clapper put it this way:
We are expected to keep the nation safe and provide exquisite, high-fidelity, timely, accurate, anticipatory, and relevant intelligence; and do that in such a manner that there is no risk; and there is no embarrassment to anyone if what we’re doing is publicly revealed; and there is no threat to anyone’s revenue bottom line; and there isn’t even a scintilla of jeopardy to anyone’s civil liberties and privacy, whether U.S. persons or foreign persons. We call this new approach to intelligence: “immaculate collection.”
Even the Director of National Intelligence should be free to be facetious when it’s appropriate (and it was), but still the implied point of his remarks are clear: our mission is such that the characteristics of collecting data creates risks to things like privacy and reputation and they are risks we must take.
The term “surveillance” is often used to refer to real time monitoring and conjures up images of CCTV cameras everywhere. The word itself refers to words meaning, “to watch.” Increasingly, as discussed above, it also refers to the wholesale collection of structured data, to aggregation, sometimes called “dataveillance.”.
Some Public Health experts will claim that their brand of surveillance is always welcome. John Snow’s analysis of the 1854 Broad Street cholera outbreak in London is a legend among data scientists and public health analysts. The data were clear, the cause of the outbreak accurately identified, and the lifesaving remediation simple and effective. The history of surveillance for public health purposes has nightmares in it as well. The Tuskegee syphilis experiment was an inexcusable case of the inhumane gathering of data.
The major differences between the two examples above are obvious. The Londoners in 1854 were contracting cholera where they lived and the goal was to prevent additional cases of a deadly disease. The unfortunate victims of the Tuskegee experiment were already ill, removed from their homes and had the cure withheld from them.
But the similarities are also clear: they both involved aggregating data for the purposes of public health and the Tuskegee victims and the London victims of cholera all represented data points to the analysts studying them.
Then there’s public safety and security. Advocates of government surveillance in the name of public safety claim that the public is safer when there are guard[ians]s watching and so surveillance is always welcome. They sometimes claim that the risk of such dataveillance having adverse impacts on society can be mitigated by transparency.
For example, under the Data Mining Reporting Act of 2007, the government is required to report on data mining programs and recognizes that data mining is an activity…
…involving pattern-based queries, searches, or other analyses of 1 or more electronic databases, where—
(A) a department or agency of the Federal Government, or a non-Federal entity acting on behalf of the Federal Government, is conducting the queries, searches, or other analyses to discover or locate a predictive pattern or anomaly indicative of terrorist or criminal activity on the part of any individual or individuals; (Data Mining Reporting Act of 2007, Section 804(b)(1)(A))
Given the recent disclosures around the wholesale collection of phone records by the NSA, reports of instances of abuse of that data mining and the debate surrounding that, it would be difficult to say that even for public safety surveillance is always welcome.
In addition, there’s surveillance as a means of control over the subject of the surveillance. Beginning with Jeremy Bentham’s design for the Panopticon in the 1790’s, omnipresent surveillance is often discussed as a means of social control. Critics of this acknowledge its welcome aspects, for example: enforcing no-smoking ordinances, ensuring appropriate behavior in public places, deterring theft and otherwise finding dangerous outliers, etc.
They also point out that the widespread use of this kind of surveillance can cause a “dangerisation” of one’s view of the world around them: “The mere visibility of the [anti-theft surveillance] system on the one hand sustains in users the constant awareness of the probability of a threat and, on the other hand, automatically transforms the usual consumer into a ‘non-thief’.” (Michalis Lianos. Social Control after Foucault. Surveillance & Society 1(3): 412-430).
To put it another way: it is one thing to behave as a “law abiding citizen,” it is another to be reminded by continuous surveillance that there are those who do not behave that way.
Just as with everything related to Big Data, the question of wholesale data-centric surveillance is also often discussed in terms of ensuring that an individual’s right to privacy is respected. Indeed, protecting individual privacy is increasingly important as more and more data are collected.
Of the questions I am raising here, this one is the most widely discussed and debated. Are there times when the collection of data itself needs to be challenged? Surveillance may be a necessary evil, but is it always evil? Is it always necessary?
I think the definitive answers to these questions are to never stop asking them.
Is resistance futile?
We must be careful that the question of whether or not the data are secure does not drown out the question of whether or not to aggregate the data in the first place.
Kim Crawley’s recent opinion piece in SC Magazine, The problem with Big Data, provides an example of this oversight. Crawley explains the security risks that we collectively take with Big Data. Crawley points out “Our Big Data technology wasn't initially developed with security in mind, but we must work hard to correct that. It requires fixing what we have now, and constantly monitoring and fixing systems as they grow.”
While Crawley discusses the essential security basics of encryption, access control, hardening the environment, monitoring it, pen testing it and making sure qualified security professionals are maintaining it, she never considers the idea that some risks are undertaken by capturing data that should not be stored.
“Thieves can't steal what you don't have. Data minimization is a powerful element of preparedness. The rules are disarmingly simple: Don't collect information that you don't need,” writes Brian Lapidus, COO of the Cyber Security & Information Assurance practice of the security firm Kroll, discussing how to prevent data breaches.
Even though eliminating the risk of a data breach by not storing the data is as accepted a risk mitigation strategy as, say, log monitoring, Crawley does not mention it.
Still, Crawley is not wrong to take for granted that repositories of Big Data exist and will continue to grow and that it is the job of the security professional to protect them. Big Data is not a fad. Our understanding of it will grow, but likewise, its uses will as well. And aggregation and its consequent impacts on how we think, what we think about and what we expect the result of that thinking to be—what I’ve been referring to in this series as a “monopoly of knowledge”—is not slowing down.
Advances in technology and refinement of the tools for using Big Data add to that momentum. In addition, we have to acknowledge the eagerness of the growing analytic community to aggregate and use Big Data. How eagerly is information collected? Consider New York State’s collection of Health Care utilization data:
Health care facilities must submit on a monthly basis to the SPARCS program, or cause to have submitted on a monthly basis to the SPARCS program, data for all inpatient discharges and outpatient visits. Health care facilities must submit, or cause to have submitted, at least 95 percent of data for all inpatient discharges and outpatient visits within sixty (60) days from the end of the month of a patient’s discharge or visit.
Health care facilities must submit, or cause to have submitted, 100 percent of data for all inpatient discharges and outpatient visits within one hundred eighty (180) days from the end of the month of a patient’s discharge or visit. (title 10, section 410.18.b.1. (iii) of the Official Compilation of Codes, Rules, and Regulations of the State of New York, emphasis added)
That data submission is mandatory makes it ironic that SPARCS stands for “Statewide Planning and Research Cooperative System”. In the context of New York State health care providers, “cooperative” means that they cooperate on the use of the data and cooperate via a governance committee to control disclosure of the datasets. Submission of the data, on the other hand, is not a matter of cooperation but of compliance with regulations.
Private sector aggregations of data rarely have regulations governing them as tightly as the New York State SPARCS data. Some large repositories, like credit bureaus in the United States, may be governed by regulations that limit their use. Others may be governed by laws that keep them from being created and/or transported across international borders.
In more and more countries, repositories that contain Personally Identifiable Information (PII) are governed by regulations aimed at protecting individual privacy and that is increasingly important as more and more data are collected.
Transactions that include creating an electronic record of each transaction are increasing. Many events that would not traditionally be recorded in an electronic record are being redesigned to do just that. Cash registers become Point of Service data collection terminals and medical devices generate a digital stream of data reporting on the patients they’re hooked up to.
The security professional needs to recognize that analysts need data to do their jobs and deliver their value to the organization that both the analyst and the security professional work for. The ROI demonstration of some automation upgrades in an organization may even be stated in terms of the value of the analyst’s output. (The Clinical Quality Measures defined for the U.S. Federal Government’s Meaningful Use program, which provides monetary incentives for health care providers to implement electronic medical records, is a very clear example of this.)
The organization’s need to generate value from the data it collects and the analyst’s need for the raw material of their work are powerful motivators for creating large datasets.
Debates around how to use large aggregated datasets, who should access them and how to govern and protect them are essential. Those debates should begin with the question of whether or not the dataset should be created in the first place.
Can means be defined so that a population regresses towards them?
Big Data is sometimes described in terms of “v-attributes:” volume, velocity, variety, value, and variability. Big Data is composed of a high volume of data. It grows at a high velocity because the number of records that make up a Big Data dataset increases as more and more electronic transactions are recorded and aggregated. Big Data also tends to include a variety of data points and even feeds from a variety of sources making its predictive power greater. All these attributes add to the value of Big Data.
Then there’s variability. This describes the fact that Big Data tends to be comprehensive and therefore representative of the wide variation of characteristics of a population. When Big Data is used to describe the population whose data it contains, variability is just another attribute. Statistical models do not have difficulty accounting for variability. “Variance,” in fact, is a technical term used in statistics to help describe the relationship of data to the mean, i.e., how spread out a population is relative to the average for that population.
In “When big data meets dataveillance,” Sara Degli Espoli, refers to the v-attributes and then defines 4 steps in realizing the potential in Big Data analytics (The four definitions below are four direct quotations from her article; emphasis is hers):
- Recorded observation refers to the act of paying close attention—by watching, listening or sensing—to someone or something in order to gather and store this information in electronic format.
- Identification alludes to the recognition of an object, or a person’s identity, through the analysis of an object, or a person’s unique features.
- Analytical intervention refers to the application of analytics to the transformation of the collected information into knowledge, usually as a result of the first two types of actions mentioned above.
- Behavioral manipulation indicates the ability of influencing people’s actions intentionally.
It is this last step, where high variability can be seen as a drawback.
While analytics has no problem at all with variability, operationalizing behavioral manipulation is more complicated the more variability comes into play. This is where the discipline of data science, the field of operations research and the newer applications of behavioral manipulation are biased against variability.
In a world of data analysis, this bias has its origins in another common use of data that originated in the world of manufacturing: Six Sigma. Six Sigma, in fact, describes a statistical result that stands for extremely little variation from the mean (resulting in 3.4 defects in a process per every million opportunities to have a defect). “One of the guiding principles behind Six Sigma is that variation in a process creates waste and errors. Eliminating variation, then, will make that process more efficient, cost-effective and error-free.” (from the Villanova University definition of Six Sigma)
A normally distributed population regresses towards the mean over time. Behavioral manipulation that is driven by the data on a population might create a feedback loop between the data and the actions it takes to influence the population’s behavior based on those data.
Does successful “behavioral manipulation” increasingly steer individuals towards a state where “one size fits all?” This question is not meant to imply a moral judgment on these activities or even imply that the practitioners of it are necessarily aware of this effect. But it is a question that should be asked if for no other reason than it might fall under the category of defining “unintended, perhaps unwelcome, consequences.”
A conclusion against certainty
Certainty can destroy useful discovery. The current momentum of Big Data collection, analytics and use, our acknowledgment that our actions can have unintended consequences and the sheer volume of data being collected all suggest that this is no time to be certain of things. I’m sure of that.
 Pre-dating the world of Big Data, the bias against variability has been promoted by economists, who for centuries have demonstrated that efficiency comes with specialization.