Statistical Imaginaries: An Ode to Responsible Data Science

This text was first presented as a talk at the 2021 Microsoft Research Summit. If you would prefer to watch the talk, you may do so here.

Three hundred thirty one million, four hundred forty nine thousand, two hundred and eighty one. According to the US Census Bureau, that is total residential population of the United States as of April 1, 2020. That number is a form of data theater.

Statistics, in their original form, is the state’s science. State-istics. The original name, btw, was political arithmetic. Today, we distinguish the state’s statistics by calling them official. These are data in the oldest of senses. The Latin root of data comes from a notion of “the givens.” The state is producing statistics that are then given to the public as data. And those givens are then treated as facts.

When statistics are understood as facts, the public expects exact knowledge. Precision. A census is supposed to be an enumeration of the public, a count of all the people. The Census Bureau produces numbers as facts, knowing full well that such a number is only the best number producible given the procedures the bureau uses. But just because the process worked does not mean that the result is as exact as the number implies. The Census Bureau could tell the public that the total population is almost 331 and a half million people. Rounding would communicate uncertainty. They could also offer up a number with digits after the decimal point, conveying that there are models underlying the data. But they do not. They produce precision because precision signals authority. Because precision is a norm and expectation. Because there is pressure to be precise.

Now… You might not think that the census is interesting or relevant to what you do, but if you care about data of any form, the census matters. Censuses are the backbone of countless data-making practices. Any data that is nationally representative is pegged to a country’s census. If you’ve ever paid attention to a country’s GDP, employment rates, or housing, that country’s census data are inevitably inside. If you’ve tracked covid rates or vaccine information around the globe, once again, national census data is likely at the root of the data.

Statistical agencies are charged with producing official statistics for use in public policy decisions and research. In the United States, census data are used to apportion political representatives and distribute funding. In other words, census data are explicitly and Constitutionally democracy’s data infrastructure. But even when political representation is not directly tethered to census data, such data are highly political and deeply contested. This is true in many countries where knowing information about the population is as much about politics as it is about accounting. This is what drove the United Nations to create a statistical commission in 1947 to formalize international standards in official statistics, to help promote the professionalization of national statistics in order to help statistical agencies resist political interference.

But the professionalization of national statistics also created an important question: what are statistics when they are no longer political arithmetic? What do all who are invested in data imagine statistics to be?

In most technical communities, it’s easy to think of statistics as objective, scientific, mathematical work. The ideal of objective information exists because decision-makers find comfort shifting responsibility onto data. This allows people to avoid the politics of data and the politics of decision-making, to pretend as if the involvement of data makes things neutral. This is dangerous thinking. This is how data become weaponized.

But an objective framing also obscures the deeply political origins of many techniques we now take for granted. (Psst, data scientists: Next time you do a regression, remember that this technique was developed by the same man who invented eugenics; his interest in this technique was not benign.) For all of the dubious roots of many methods and practices in statistics, the development of mathematical statistics also opened up a deeper appreciation for how we can understand the limits and biases of analyses. For example, in the 1910s, a group of Black clerks at the US Census Bureau began calculating the undercount of Black people in earlier censuses. This opened up new possibilities for remedying operational procedures that produced differential undercounts. As statistical techniques became more advanced, scientists also began imagining how mathematical interventions could repair weaknesses in the data. Throughout the 20th century, as the field of statistics became professionalized, more and more government agencies began imagining how scientific advances could benefit their mission.

Recognizing the possibility of producing higher quality statistics, government agencies began deploying scientific advances to improve the quality of their data. But, in the US at least, this work was often met with political resistance. For example, the Census Bureau introduced sampling to reduce the burden on respondents asked to answer numerous detailed questions. By 1957, Congress barred the Census Bureau from using sampling in key data products. Recognizing the significance of missing people in the count, the Census Bureau began leveraging data from other sources and building models to fill in gaps in data using a technique known as imputation. This too was challenged in court when the state of Utah argued that the Census Bureau had no right to impute data, both because sampling was statutorily forbidden and because imputation would violate the Constitutional requirement of an “actual enumeration.” This was politically framed as “fake data” or “junk science.” The Supreme Court rejected these claims, arguing that imputation was not a statistical method, but a technique that improved the count. Yet, in making this claim, the high court also became an arbiter of statistical methods.

Statistical agencies are required to produce high quality statistical knowledge, but who decides what constitutes statistical knowledge? Those who are invested in modern statistics and advancing science presume that the goal of a statistical agency is to create mathematically valid statistical knowledge. That the output of a census should be the best proximate quantitative depiction of the country possible. But not everyone sees the concept of statistics in this light. For those who envision a census as a count of all of the people, data in a census that does not directly reflect a person is suspect. From this vantage point, the work of the Census Bureau is to focus on the act of counting and reporting what is counted. The goal is the best count, not the ideal data. This distinction, as we will see, can quickly become extraordinarily messy.

….

The moment that data matter, those data can never be neutral. The greater the stakes, the less objective those data can be. The very choices of what data to collect, how to categorize data, and how to present data reveal ideological, social, and political commitments. The United States started collecting data about race in 1790 in order to enact white supremacy, but the United States still collects data about race because laws passed during the 1960s Civil Rights Era attempted to reclaim census data and use this data as a cornerstone to anti-discrimination and voting rights laws.

When institutions collect data, there are politics to what data are collected, and there are also politics to what data are not collected. France outlawed its census from collecting data about race and ethnic origin in 1978; they outlawed the collection of data about religion back in 1872. This is viewed by proponents as crucial to enacting a race-blind secular society, but critics argue that failure to collect this data means that France is ill-equipped to grapple with inequities and racism. Of course, France still does a census, even if it tries not to see certain things. Lebanon, on the other hand, did its last census in 1932. Back then, they found that roughly half the population was Christian and half Muslim, split evenly among Sunni and Shia. Lebanese politicians have repeatedly rejected proposals to conduct another census. Cuz knowing how the population has changed presents significant political risks.

Many countries have laws that depend on the existence of data. But that doesn’t guarantee that the data will exist. In the United States, there is a lot of verboten data, but one might think that if the Constitution requires the data, the US would collect the data. This is not true. After its Civil War, the US passed a constitutional amendment abolishing slavery. Three years later, the US amended the constitution again to ensure that people who are born in the US are citizens and to eradicate the most egregiously racist clause of the Constitution by ending the practice of treating enslaved persons as 3/5th of a person. But there’s a third data-centric clause to the 14th amendment that no one remembers. Should any state prevent a male over the age of 21 from voting, unless convicted of a crime, the federal government is supposed to subtract the equivalent number from the total population for that state, thus weakening their political apportionment. This Constitutional Amendment presumed that the government could and would collect the necessary data. Yet, in response, Congress has not permitted the Census Bureau to ask who has been denied the right to vote since 1870. This is an act of purposeful ignorance.

Statistics helps us know different aspects of our nations, but what we are able to ask about depends on a range of political, ideological, and economic commitments. We have seen this issue on full display recently as countries struggle to collect and produce data about the COVID-19 pandemic. But the politics surrounding data aren’t just about what data is or is not collected. Political forces also shape our ability to talk responsibly about the limits of data.

….

Censuses are never perfect. Censuses miss people. They miss people because they are elsewhere and they miss people because not everyone wants to be counted. And they miss people because not all people are deemed legitimate enough to be counted by the state. In other words, they miss people for operational reasons, for social reasons, and for political reasons.

When official statistics are understood as objective gifts of data given by the state, they are presumed to be able to speak for themselves, to tell their own story. But data don’t speak for themselves. They can’t. They speak on behalf of others. And what they say depends on the goals and interests of those trying to coax them into speaking. Many powerful people use data to justify their decisions. The objective glean of data gives decision-makers cover. Yet, when policymakers and executives leverage data to justify their actions, they want the data to stay on message. For data to stay on message, data must communicate precisely and with confidence. Such data cannot reveal their own warts and limitations, raise questions or offer alternative interpretations. Data must not be seen as weak, for data that are viewed as weak threaten the legitimacy of statistical work.

Just as data must stay in line, so must data users. Researchers dissect and analyze undercounts after every census, but when the data are produced, those who want to build policies based on them want facts unsullied by uncertainty.

I had a conversation with an applied demographer about how she presented data to policymakers. She recalled a story of her early days working for a municipality where she attempted to communicate confidence intervals alongside the data only to be rebuffed. Her government client did not want to hear that there could be error in the data. If there were error, the data could not be trusted. She was told to come back with facts. Versions of this story were repeated to me by countless data analysts whose job was to coax data into speaking on behalf of decision-makers, both in government and in industry.

Anyone who has worked with data has — at one time or another — asked data to speak for themselves. “Just look at the data!” is a statement of exasperation that happens in both board rooms and at kitchen tables. But how practitioners try to communicate the limitations of data varies significantly. When required to support decision-makers who are innumerate, data users often feel most comfortable ignoring uncertainty and error, knowing that such info will only confuse or anger. Those deep in technical weeds cannot comprehend how anyone can ethically work with data and ignore such signals. Of course, there is also an art to presenting uncertainty knowing that the person receiving the data might either ignore the uncertainty or spin the uncertainty to undermine the data. Just consider the context of political polling where pollsters dutifully point out that the outcomes were within the margin of error when their predictions are wrong, even though they know full well that their data was presented to suggest a definitive outcome with the margins of error functioning as a tiny footnote. Or, on the flipside, when climate scientists try to responsibly communicate the uncertainty of complex models only to have their work systematically undermined as lacking certainty.

The Census Bureau is expected to produce facts and perform precision both to appear authoritative and because any responsible scientific communication involving uncertainty can be politicized. The government scientists know that data has limitations. Much of this information is communicated in scientific meetings and in footnotes and metrics, often buried in esoteric files published alongside the data. Government researchers also provide follow-up analyses after publishing data. But, by and large, those who rely on democracy’s data infrastructure tend to ignore signals of uncertainty, error, or noise when using the data. Some ignore it because they don’t know how to work with such information; others ignore it because their clients want to hear facts and precision. Still others see the very discussion of uncertainty as creating risk that the data will be de-legitimized.

Yet, in my mind, the illusion of perfect census data has become more costly than people realize. Without being able to grapple with the limitations of data, the Census Bureau is unable to get the social and political support to introduce new techniques that can systematically improve federal statistics. This is particularly costly in a social context in which getting people to self-respond or share information with government representatives is getting increasingly challenging. The scientific community has developed a range of techniques to improve data quality in spite of data collection limitations, but embracing these requires that stakeholders understand data’s limitations and vulnerabilities.

One reason why census data are imperfect is that the public does not always trust the government to be a data caretaker. Since 1840, those conducting the census in the US have known that privacy is key to getting people to participate. For the Census Bureau, statistical confidentiality is a key requirement for procedural, legal, and moral reasons. The procedural imperative has only grown since 1840, with numerous studies repeatedly showing that people will not respond if that data can be identified. For over a century, there have been legal requirements that prevent census data from being accessed for non-statistical purposes. More recently, as scholars uncovered how census data were used in the US and Europe during World War II, there has been an increased commitment within the statistical community towards data privacy more generally.

To achieve statistical confidentiality, the Census Bureau has been evolving its disclosure avoidance procedures for years. Initially, the government simply chose not to publish certain statistics but, by the 1980s, the Census Bureau was facing significant pressure to publish more detailed data. So, in 1990 census, the Census Bureau began injecting noise into the published data in order to publish small area geographic data. The noise that was injected was not systematic, but edits designed to smooth outliers from visibility. This should’ve made statisticians cringe, but few knew that this process was happening and, to this day, no one outside of the Census Bureau knows much about the level of edits that took place.

Meanwhile, as computer scientists began showing that these edits provided little protection, these same computer scientists also began developing differential privacy as a possible intervention. Differential privacy is nothing more than a mathematical definition. It is a framework for being able to evaluate how much information about individual records leaks from statistical data. By introducing statistically unbiased noise into the data, a system based on differential privacy can try to maximize statistical confidentiality and high quality statistical output. The allure of differential privacy to those invested in statistical confidentiality is that such a technique is future-proof. No matter how much new information is introduced, a system that is managed through differential privacy can only leak as much information as it was designed to leak.

There are numerous ways to implement differential privacy, but all involve Greek letters serving as variables that govern key aspects of the system. One of these letters — epsilon — represents the privacy-loss budget within a differential privacy system. Think of this as a knob. Turn the knob one way and the data are noisier but have higher privacy protections. Turn it the other way and the noise recedes, but the data become more vulnerable to attacks. Systems that rely on differential privacy are designed to enable decision-makers to manage the amount of noise introduced in order to strategically balance the privacy of individual records and the reliability of statistical calculations.

Differential privacy takes four things for granted. First, it presumes that publishing usable statistics while protecting the confidentiality of underlying data is imperative. Second, it presumes that usable statistics can be understood in mathematical terms. Third, it presumes that data users find value in knowing, understanding, and measuring noise, error, and uncertainty. Fourth, differential privacy presumes that transparency is desirable.

The Census Bureau began integrating differential privacy into its scientific products in 2006, making previously inaccessible data available for the first time. The scientific community cheered. But the decennial census is different than other data products produced by the Census Bureau. And so when the bureau decided to modernize the statistical disclosure system used for its canonical product, they failed to appreciate how much backlash they would receive. The bureau relished the ability to be upfront about its procedures. Scientists imagined that this would allow for better governance of the statistical system, and better accounting of uncertainty. They thought users would be pleased. They were wrong.

Lawsuits began even before census data were released. More are expected. Once again, we can expect the Supreme Court to have to grapple with what statistics are. Some opponents to differential privacy have scientific concerns, but many of those who are challenging the bureau’s right to modernize its disclosure avoidance system do not see census data through the lens of mathematics. They want the data to be facts, to speak for itself. And they see differential privacy as an abomination for daring to alter data in the first place. To complicate matters more, there are also people who see political opportunities to fighting the bureau, regardless of the ramifications to statistical work.

Transparency is an ideal that’s common in computer science, especially in areas birthed from cryptography which holds a deep moral commitment to transparency. Likewise, mathematicians and computer scientists see uncertainty not as something to avoid, but something to actively embrace. Within this way of seeing the world, advances in scientific method to improve data quality and negotiate statistical confidentiality are a statistical boon. But they are also a political nightmare.

Epistemology is the study of knowledge. How we know what we know. Science is the pursuit of knowledge through rigorously defined methods and practices. Scientists have historically been convicted of heresy and burned at the stake, but in the 20th century, scientists have achieved significant prominence in many societies. Unfortunately, their rise in stature is not always well-received, especially when scientific findings are viewed as an economic or ideological threat. In the 1980s and 1990s, scientists weren’t physically tortured, but their practices were regularly hijacked, often under the guise of “sound science.”

The most blatant abuse of scientific process unfolded in the fields of climate science and public health, as the oil industry and tobacco industry worked to seed doubt about scientific consensus on climate change and cancer. More than anything, these efforts perverted scientific uncertainty to enable analysis paralysis and make action impossible. In the 1990s, a group of scholars came together to make sense of this phenomenon. They coined a term “agnotology” to describe the study of ignorance. Ignorance isn’t simply what we don’t yet know; it also refers to knowledge that has been lost and knowledge that has been purposefully polluted.

Uncertainty is central to the scientific process. But in a public policy context, uncertainty is seen as toxic and dangerous. The politicization of uncertainty to undermine scientific consensus during this period is part of why those seeking to ensure the legitimacy of the federal statistics often default to rejecting any information that might undermine confidence in the data. Today, policy-oriented people don’t want to talk about uncertainty because, for 20 years, they watched how uncertainty be used to undermine scientific knowledge and evidence-based policymaking.

Census data are the product of scientific work. They are also infrastructural in our society, core to countless policies and practices. Lives depend on that data. Economies depend on that data. Public health depends on that data. Those using census data want to know that they can trust the data, that they can rely on the data in their calculations. Scientists who work on those data are obsessed with quality, but they have never been able to produce perfect data. Yet, the more that data are politicized, the more perfect they are expected to be. And the more perfect they are expected to be, the more those invested in the legitimacy of the data are expected to suppress discussion of uncertainty, noise, and error. In the process, an illusion has been born.

Drawing on the work of other scholars, I can’t help but think of this illusion as one type of statistical imaginary. In my mind, a statistical imaginary forms when people collectively construct a vision of what data are and what they could be. For example, when the Constitutional conveners imagined conducting a census to anchor a democracy, they were creating a statistical imaginary. Corporations also produce statistical imaginaries. For example, when companies create advertisements talking about all of the benefits of “big data” and AI, they’re producing a vision.

Statistical imaginaries don’t have to be far-fetched fantasies. They don’t even have to be illusions; they can be deeply grounded in practice, rooted in pragmatic goals, and realized through technical systems. But they can also come unmoored from practice when the illusion of what statistics should be is more appealing than the reality of what it is. Machine learning is a powerful tool, but the fantasy that machine learning can solve all societal problems is disconnected from reality.

The key to responsible data science is to keep statistical imaginaries in check. Many famous people have spoken about the dangers of lying through statistics, contorting statistics to say things inappropriately. There is also a danger of producing a statistical imaginary that can’t be realized. Responsible data science requires us to ground these conversations. Yes, data must be sound. But so too must be the technical, cultural, and political logics that surround the analysis and use of data.

All data are made, not found. The notion that data can be the product of an apolitical act of counting feels warm and fuzzy. But it is an illusion. And that illusion obfuscates how data categories are politically contested, how choices in collection and processing require human decisions. But the biggest problem with this illusion is that it encourages those involved in data work to ignore the limitations of data in order to appease an ideal of objective facts. Data cannot be treated as givens. Their imperfections and context must be lovingly engaged with.

Engaging with uncertainty is risky business. People are afraid to engage with uncertainty. They don’t know how to engage with uncertainty. And they worry about the politicization of uncertainty. But we’re hitting a tipping point. By not engaging with uncertainty, statistical imaginaries are increasingly disconnected from statistical practice, which is increasingly undermining statistical practice. And that threatens the ability to do statistical work in the first place. If we want data to matter, the science community must help push past the politicization of data and uncertainty to create a statistical imaginary that can engage the limitations of data.

As technical researchers and scientists from around the world, y’all have a role to play. We all owe it to our respective communities to ensure a more responsible data future. This means that we must resist the fantasies that surround data in our present world and help ground the data work that is emerging. Many of you are already committed to producing metadata about datasets to render visible the features of that data. This should be standard practice. But take it a step further… How are you working to understand how data is being used? And what are you doing to assure that data is being used responsibly?

The politicization of climate data and cancer data 20 years ago should’ve been a warning. The politicization of data is now all around us. It threatens the legitimacy of democracy’s data infrastructure. it threatens the ability to understand public health crises. It threatens the ability for individuals, businesses, and governments to make informed decisions.

Many of you here today are toolbuilders who help people work with data. Rather than presuming that those using your tools are clear-eyed about their data, how can you build features and methods that ensure people know the limits of their data and work with them responsibly? Your tools are not neutral. Neither is the data that your tools help analyze. How can you build tools that invite responsible data use and make visible when data is being manipulated? How can you help build tools for responsible governance?

Some of you here today are critical scholars, watching this all unfold. We have all seen technologies be used to enact abuses and reify structural inequities. But let’s also be careful. In some contexts, our critiques are getting twisted around to undermine data infrastructures that uphold democracy and civil rights. Context matters. Yes, we need to critically interrogate how technology upholds systems of power. But we also need to be cognizant of who benefits by seeding doubt and undermining science and statistics.

Census data are a canary in the coal mine. The controversies surrounding the 2020 census are not going to go away in short order. The statistical imaginary of precise, perfect, and neutral data has been ruptured. There is no way to put the proverbial genie back in the bottle. Nothing good will come from attempting to find a new way to ignore uncertainty, noise, and error. The answer to responsible data use is not to repair an illusion. It’s to constructively envision and project a new statistical imaginary with eyes wide open. And this means that all who care about the future of data need to help ground our statistical imaginary in practice, in tools, and in knowledge. Responsible data science isn’t just about what you do, it’s about what you ensure all who work with data do.

This talk draws from and responds to a bunch of different scholarly texts. Here are some to start with: