Big Data And Us Little People
By David Sheidlower
The last series I wrote for securitycurrent dealt with principles of data security and privacy. Many authorities charged with enforcing data protection accept the principles. They are based on the idea that the actors in data transactions ( i.e., subjects, collectors, disclosers, users and regulators) all have a role to play in creating and maintaining the world of data.
The argument goes that if we all understand each other’s roles relative to any given data point then the world of data itself, the data-centric system, can be understood.
What if that’s not entirely true?
At some level, it is of course accurate. And it is tempting to believe it. Our very experience of data creation when we are the subjects of a data point reinforces that idea: that a data point is just that. Extending the metaphor of a “point” from Euclid’s Geometry, we think of a series of data points as forming a line, a trend.
A data-centric system might just be a collection of individual transactions that form a static snapshot. When the points are connected like so many dots, they form a trend line.
In this view, the system as a whole is the total collection of stored records and we only need to deal with how individuals and organizations interact with those records. It is a role based model for data access and control and it has been mostly adequate for our purposes.
Privacy advocates have worked along these lines since data was understood to be sensitive. Regulators also consider this to be the proper paradigm. The collectors, disclosers and users of data have all worked to define and understand their roles in light of regulations and ethical considerations.
But two recent events call into question whether we can see data as just records.
The events involve separate court cases in different courts, one brought byMario Costeia Gonzalez of Spain and the other David Leon Riley of the United States. The former won a case in the Court of Justice for the European Union and established a “right to be forgotten.” The latter won a case in the United States Supreme Court in which the justices acknowledged that the data on a cell phone was unique because the device itself functioned as an aggregator.
In both cases, what the courts recognized was the concept that data aggregation is distinct from the transactions it aggregates and that by itself, the aggregation of data could be controlled apart from the transactions themselves.
We are used to thinking of data as a static snapshot. Perhaps only valid for the time the snapshot is taken, but still fixed for that moment. A data extract, an analysis, the results of a search, these are all the product of computer processing and are thought to be objective and, to a large extent, without bias except the bias in the data itself.
This is not to say that analysis or search terms cannot also contain bias, but the general belief is that the procedures which process and store data do not introduce bias into the content.
Credit scoring, one of the earliest example of “Big Data” is regulated by Federal regulations and those regulations explicitly reflect that belief. Assuming the data are collected appropriately, then the credit scoring model, by law, must be "empirically derived, demonstrably and statistically sound" Equal Credit Opportunity Act 1974 (Regulation B), Section 202.2(p). Even when bias has been found in these models, they are traced to bias in the inputs or the model’s algorithms, not the acts of aggregation, processing and storing the data.
Jacques Valle in The Network Revolution (1982) expanded on Marshall McLuhan’s “global village” from The Gutenberg Galaxy (1962). Valle, writing before the Internet exploded, described the emergence of a great communication network that presented new possibilities for individuals to have access to one another. Valle and McLuhan were concerned mostly with communication, not data; with access, not aggregation.
What if data aggregation and processing has the properties of mediums like radio, print or television? Communication theorists might argue that radio, print and television all assume a sender and a receiver whereas information-processing models do not assume that the action of “input” intends to communicate through the action of “output.” In other words, data isn’t necessarily sent, it’s processed and stored. Data output? Sometimes.
Others have noticed the similarity and overlap between information processing and communication models. My basic premise begins from the work of Harold Innis in the 1950’s and Marshall McLuhan in the 1960’s. The premise that the medium itself creates bias is, in other words, not a new idea. I’m going to try to apply that idea to data-centric systems and see where it leads.
This series intends to look at the following aspects of a data-centric system:
- Aggregation is a form of narrative (search engines create narratives)
- Narratives are not snapshots (analysis and search results are more than snapshots)
- Being part of a cohort is a new state of identity (regressing towards the mean might be a problem if you have no say in the terms used to create that mean)
- Search engines and data analysis are a medium as powerful as any that have come before it (successful media create monopolies of knowledge)
- All medium have bias
This is not a call to action. It is not a manifesto declaring that there is a moral truth to one approach to data or another. It is an attempt to describe a transition that is in progress but has not yet fully taken hold. It is recognition that while what Innis and McLuhan recognized as the inherent bias of using a phonetic alphabet has not been fully replaced by what McLuhan called at the time the “electronic age," there are things that are worth observing now.
Why would a security professional care about this? The short answer is that we protect data and so the more we understand about it, the better we can protect it. But I can provide a more thorough answer than that.
Just like the network perimeter of the enterprises we protect is no longer a walled-in fortress, the data we protect is no longer just a single flat file or set of related tables sitting on a server in the center of that fortress.
You can only really protect an information asset you understand. You can only guard against attacks that you can imagine. In the one section of his book that zeroes in on databases, Valle sums up why Security professionals should care how data aggregation and processing is evolving: “Anything that can be designed by human logic can be fooled by human ingenuity.”
If you want to truly protect data (even your own), you are going to have to accept that sometimes it slips its moorings.