For Whom the Bell Curve Tolls
By David Sheidlower
People prefer to choose the groups they are in. Even before social media exploited that, there were fan clubs, fraternities, sororities, and many different kinds of groups that people associated themselves with.
There are also the groups that people don’t choose but through birth, prejudice, unforeseen circumstances and/or unwanted diagnoses, they find themselves in nonetheless. Those groups are generally more difficult to leave.
There is a different kind of group that can encompass any of these but does not have to. These groups overlay a different relationship between the group and the individual: the cohort.
With the exception of specific research studies, individuals do not voluntarily join a cohort the way they join other groups. They do not first consider the goals or intentions of those forming the cohort and then, having decided to support them, “sign up.”
For example, obtaining a car loan is for the purpose of purchasing a car, not for the purpose of having your loan payment history contribute to the credit-scoring model that will be used to underwrite future loan applications of people like you.
Even when individuals find themselves in a group involuntarily, say in the case of a diagnosis, their relationship to the cohort is more complex than just one of membership. Virtually nobody says “yes, I just got hepatitis and the first thing I am thinking about is how I will now be part of the statistics on people with hepatitis.”
Aggregation is biased towards anonymity. Its ability to form cohorts creates a monopoly of knowledge that is based on the predictive power of the cohorts that comprise the population. When the population is formed correctly and the aggregation of data that brought it into being is sufficiently broad then, with few exceptions, it will be able to be represented as a normal distribution. Normal distributions, with the majority of a population building up from two extremes to a central mean are graphically represented as bell curves.
In the previous article, Anti-Viral, I discussed how aggregation is biased towards anonymity. This bias is so strong that the EU Court of Justice found that the appearance of a named reference to an individual in an aggregated dataset was not to be considered the processing of personal data so long as the search parameters that led to that aggregation did not name the individual.
In other words, your right to be forgotten only applies when the searcher is looking for things about you. The Court defended the searcher’s right to have your information contribute to any other cohort.
The phrase “monopoly of knowledge” is discussed at length by Harold Innis, a communication scholar and historian. Writing 60-70 years ago, Innis insisted that how information is processed in a society invariably influences the relationship between what is known, how that knowledge gets created and even who gets to know it.
He distinguished between an oral society based on local discourse, the establishment of an elite literate and learned class that used manuscripts, and the commercialized distribution of printed material. His most famous successor, Marshall McLuhan, extended that idea to the developments of the phonetic alphabet, radio and television. Every monopoly of knowledge that Innis and McLuhan defined had characteristics that were unique as the technologies that enabled them.
Innis believed that advances in information processing technology eventually disrupted whatever the current monopoly of knowledge was:
“I have attempted to trace the implications of the media of communication for the character of knowledge and to suggest a monopoly or an oligopoly of knowledge is built up to the point that equilibrium is disturbed.” (Presidential Address to the Royal Society of Canada, 1947)
In this series, I am identifying a disturbance of the current equilibrium/monopoly. McLuhan predicted this in his discussion of the “electronic age.” Philip Levinson, following McLuhan, also has suggested that this disturbance or transition is occurring. Levinson is among those who believe the transition is being caused by generalized advances in communication technology and are centered on the Internet.
I am not trying to discount the communication possibilities that are opened up by the Internet, the cell phone and other “on-line” experiences. But I think the technological advances that are disrupting the current monopoly of knowledge are more specific. Those advances are the combination of three developments:
- Massive amounts of data being recorded by computer driven transactions
- Ever increasing data storage capacity and data processing power which can accommodate those records
- Advances in the understanding of the predictive power and use of aggregated data
Discreet data points concerning a single individual can be plotted out in a line. Even if those data points are just numbers (an individual’s height as they aged), the line they form tells a story (they got taller for 16 years and then their growth stopped).
The narrative is unique to the individual and regardless of the trend, our evaluation of whether or not the next data point will conform to that line or not depends at least as much on our knowledge of the individual and our belief in the power of that individual to change their future as it does on our belief in trends.
The right to be forgotten is an acknowledgement that individuals should have some control over what data points go into the creation of those trend lines. But building a linear narrative about the individual is not the only way that the individual’s data points are used to build a narrative “about” them.
When an individual’s data points are aggregated into a population that is normally distributed and they are then identified as part of a cohort within that population (in our example above: average growth given a person’s nationality, age and gender), that line hardly matters. That individual narrative is replaced by the probability of a given narrative for that cohort.
To go from the world of human physiology back to the earlier example of underwriting an auto loan: while the individual’s characteristics are what place them in a cohort, it is the cohort’s chance of defaulting on an auto loan, not the individual’s, that is reflected in a credit score used to underwrite the auto loan application of that individual.
Similarly, it is often the cohort’s most likely response to a course of treatment that the physician relies on in ordering/recommending a treatment. It should be noted that no physician would subscribe to that model, classed in the realm of what is called “evidence based medicine,” 100% of the time- they rightly reserve the right to consider other factors in their treatment recommendations.
This method of modeling behavior and outcomes is becoming increasingly important to how the world around us interacts with us. Recognizing that, Steven Salzberg, Computer Science professor at Johns Hopkins University, recommends replacing the study of Calculus in High School with the study of statistics and computer science.
The bell curve that represents a normal distribution has increased in importance in people’s lives. Yet traditional elementary school curriculum spends more time on different shapes. School children are taught the difference between a scalene and an isosceles triangle in greater detail than the characteristics of a bell curve. I do not mean to denigrate geometry or calculus.
I do agree with Salzberg that the emphasis of current curricula’s should be re-examined. However, I also am bringing up this disconnect between what is relevant to people’s lives with what is on a traditional school curriculum to further illustrate the idea above that the “equilibrium is disturbed.”
With that, how we make sense of a person’s identity is also changing.
The linear narrative created by an individual’s actions are being replaced by the individual’s current identity. The narrative is then no longer about how the individual’s past might lead up to the individual’s next action but about how the individual’s present identification with a group, a cohort, more or less predicts their future.
When a thief steals an identity in order to fraudulently apply for credit or use that individual’s credit accounts, they are assuming the present identity of that individual. They are not attempting to impersonate someone’s past. Regardless of how good the victim’s reputation is, all the thief wants are the rights and privileges that come with the victim being in that cohort.
Some would argue that group identity has always played a part in how people were treated by others. Class-based societies judge individuals by the class they belong to and there are tremendous amounts of studies documenting all kinds of discrimination based on people identifying others as belonging to certain groups.
Many of those studies are cohort studies and they employ the same fundamentals of statistics described here. In fact, the statistical techniques used for modeling outcomes and behaviors were developed in large part by social scientists trying to measure significance in limited datasets. Those studies emphasize how the world impacts the individual, not how the individual impacts the world. In that respect, the current uses of Big Data are no different.
There are a few other important characteristics when looking at groups and the monopoly of knowledge created by aggregation of large datasets.
1.Not joining but being in anyway: Traditionally, you are aware of your membership in a group. You may have joined or you may have had membership “thrust upon you” but either way, you know you are a part of it. When data are aggregated for the purposes of modeling, the raw records that go into the model are often de-identified.
So there is usually no sense on the part of the collector or user of the data that the individual’s consent is required for this use of their data. You can usually “opt out” of being impacted by the results of the modeling, but you usually cannot opt out of having your de-identified experience contribute to the model.
The privacy policies that are maintained by data gathering institutions are clear that they apply to what they classify as personal information. Once it is de-identified, the information is not covered by those policies. For example, in the United States, health information is covered by the HIPAA Privacy Rule except when it is de-identified:
“It is important to note that there are circumstances in which health information maintained by a covered entity is not protected by the Privacy Rule. PHI [Protected Health Information] excludes health information that is de-identified according to specific standards. Health information that is de-identified can be used and disclosed by a covered entity, including a researcher who is a covered entity, without Authorization or any other permission specified in the Privacy Rule.” http://privacyruleandresearch.nih.gov/pr_08.asp (emphasis added)
2.Being an outlier (even by choice) can be the equivalent of falling in the study’s margin of error: Even if the individual were to choose to be an outlier as a form of protest or to somehow muddy the description of the cohort’s results, their behavior would be classified by the model as not significant.
The individual in Spain who won a case in the EU Court of Justice against Google for the right to be forgotten offers an example. He would find that the data point he wanted forgotten would appear in a dataset of “Spanish real estate transaction in the late 90’s” if the dataset were aggregated with that description. The Court’s ruling specifically allowed for the data point to be in datasets that were created without using his name in the search criteria. It is ironic that an analysis of that dataset would not be materially affected if his data point were, in fact, let out of it.
It’s Big Data and it has to be modified in big ways to impact analysis that comes from it. If, to take an absurd example, 20-year-olds were to try to convince a supermarket chain that a vitamin intended for 70-year-olds was their favorite nutritional supplement, they would have to arrange for large numbers of them to buy that supplement regularly over a significant period of time.
Even if you did not have your data captured in a given context, it would still get “counted.” Both traditional studies and advanced models today will use a technique called “inference” that can account for the behavior of those that belong to the cohort but whose data are not represented in the aggregated dataset.
3.Regression towards the mean: Traditional studies took a snapshot of data and produced analysis with it. Longitudinal studies will track a number of those snapshots over time. Studies are difficult to reproduce and refine because gathering new data to re-do the study is costly.
Increasingly, the aggregation of data occurs in a cycle wherein datasets can be almost continuously refreshed from the on-line systems that create them. Models are then easily refined. A cohort’s behavior does not just regress towards the mean—the average for that cohort—as data are collected over time, it is actually assisted in doing so.
The actions taken towards those in that cohort will reflect the average because that is the most efficient approach (i.e., has the highest probability of success). And so the individuals in that cohort will be presented with offers, treatments and opportunities that reflect the average for them and/or in ways that the majority of them will respond to.
If the only offer an individual is ever presented with is the one that the average member of their cohort will choose, then this greatly increases the likelihood that the average response will be chosen and that in turn increases the power of the model to not just regress towards the mean but to help reinforce it.
The monopoly of knowledge created by aggregation is one that groups people together and predicts their behavior based on that grouping. But it is not as passive as it sounds. The predictions are often focused on responses to offers, treatments or other actions. The power of Big Data, aggregated and analyzed, is to not only define how the average member of a group responds to the world, but to define what the world needs to do to solicit a response from the individual.
The greater the number of demographic identifiers in a de-personalized dataset, the more likely it is that someone could find an individual in it. Privacy advocates and security professionals rightly concern themselves with this aspect of protecting data.
There is no question that the risk of suffering financial and reputational harm when your individual data is inappropriately accessed is very real and can be very high.
The HIPAA Privacy Rule referenced above goes so far as to define exactly what 18 characteristics of a health record must be removed to de-identify it.
Even assuming the individual’s privacy risk is all but eliminated in the use of large aggregated datasets, the individual is still impacted by them. That impact is created and measured at the group, or cohort, level.
We do not yet have a way of defining whether or not that creates risks and if so what kind. But aggregation does create narratives and those narratives do tend to “stand in” for certain details of an individual’s life and future.
To put it another way: whether or not you feel like the needle, you cannot help but be part of the haystack.
 This article focuses on the collection and aggregation of data and so the individuals discussed are those that are the subject of those data points.
 There are, of course, other distributions that might be used in a given analysis. The normal distribution is most fitting here because we are talking about common practices in probability applied to large populations.
 This “self-fulfilling prophecy” of statistical modelling refers to responses over which people have choices and not things where they do not such as medical treatments.