In the Pursuit of Knowledge, There Be Dragons

Flickr: Maurits Verbiest

This is a talk script from my keynote at IEEE Vis 2021.

Good afternoon! I’m honored to be here today, speaking with you, in no small part because you have given me the opportunity to try to connect disparate parts of my career together into a coherent message.

I started my career in computer graphics, relishing how mathematical equations could bring images to life. But in 1999, I was enamored by the work of Judith Donath at the MIT Media Lab who was trying to visualize Usenet, an ancient form of social media that had kept me whole in my youth. I decided to enroll in grad school, where I began visualizing and animating networks of various kinds. The practice of visualizing information led me to be curious about the underlying data, which led me to retrain as an ethnographer who now studies how data and society intersect.

I’ve spent the last 3 years as an ethnographer studying the 2020 US census, trying to understand what makes data legitimate.

My goal today is to invite you to see data and information differently by weaving together some of the lessons I’ve learned traipsing through different disciplines and ways of seeing. Hopefully this will be fun. Let’s dive in…

Reality is Perception

The year was 1997 and I was working at Macromedia to build tools for animators. I flew down to Burbank to meet with one of our key customers: Disney. I watched in awe as professional artists used our tool to construct animations for the web. One pushed back at the product, pointing out that our tool hadn’t given him enough controls to contort the images the way he needed it to, to make things look real. I took notes as he told me that reality is really about perception.

I couldn’t help but think of what a friend of mine at Pixar had taught me. He had started his career doing physically-based modeling, but when he was working on Toy Story, he quickly realized that a technical model rooted in physics didn’t look real. Artists needed controls to make the images be unrealistic so that they would appear realistic.

Animators have long known that the key to communicating information to an audience requires exaggeration. When an animated ball bounces on a screen, it does not retain its form. It is stretched and squashed just to look right. This is what Disney called the Illusion of Life.

The work of visualization — like the work of animation — is fundamentally about communication. Even if your data are nice and neat, the choices you make in producing a visualization of that data shape how those data will be perceived. You have the power to shape perception, whether you want to or not. There is no neutral visualization, just as there is no neutral data. Thus, in building your tools, you must account for your interlocutors. What are you trying to convey to them? When do you need to stretch the ball so that the viewer sees the information as intended?

Contextualizing a Speech Act

In the United States, journalism is given a special status under the belief that an informed citizenry is crucial to democracy. The practice of journalism is messy and fraught. Journalists must make decisions about what to prioritize and how to communicate that information. Journalists don’t like thinking about these choices as an exertion of power. They like to imagine themselves as neutral reporters, simply amplifying information that the public has the right to know in a clear and accessible format.

Chinese journalism functions differently than US journalism, in no small part because the relationship between the state and journalism is configured differently in China than the US. But Angela Xiao Wu points out a less noticeable difference. She notes that Chinese journalists have a different theoretical orientation than Western journalists. In a country where there are more engineers than lawyers, the root theory of communication stems not from Marx and Weber and Durkheim, but Claude Shannon.

Claude Shannon’s theory of information is crucial for computer science. It is the backbone of cryptography and networking. But Shannon’s theories also make a profound statement about what it means to communicate. Shannon is less concerned with what the communicator is trying to say than with what the recipient is able to hear. Packet loss is inevitable. As such, the communicator must arrange the information to ensure that, even with noise in the system, the recipient is able to get the intended message.

These logics — focusing solely on the speech act or focusing exclusively on what someone hears — form a polarity. Two ends of a spectrum. In practice, communication is fundamentally about the relationship between the two. Tools shape this relationship. The media of journalism — print, radio, TV, Facebook — influences the communicative act. Context matters. People understand a story differently based on what they’re exposed to before or after, what’s above or below the story in a news feed. Increasingly, those producing the news have no control over the context in which it will be consumed.

Visualization is a medium to convey information. But the power of a visualization is not simply about its tool. It’s about the context. Most likely, you don’t control the context. You must design your tool with an eye for how it will be taken out of context.

The Voice of Data

Data cannot speak for themselves. Data are never neutral. Data have biases and limitations, vulnerabilities and uncertainty. And when data are put into a position of power, data are often twisted and contorted in countless ways. As the economist Ronald Coase once said, “if you torture the data long enough, it will confess.”

Learning how to truly see data is one of the hardest parts of doing data science. The first step is recognizing that data cannot be taken for granted. Data must be coaxed into showing their weaknesses. The weaknesses are not always obvious. As a tool, visualization can help reveal data’s weaknesses, or obscure them.

Years ago, I taught an introductory data science course. Students were expected to know Python and to prepare their environment for the first class. To prove that they had done so, I asked students to load one year of NYC’s Stop and Frisk data and tell me the average age of someone who had been stopped. The quickest student to the draw blurted out 27. One by one, the other students echoed the same number. Is it right? I asked. They looked me blankly. What could I possibly mean? Clearly it was right, everyone had the same answer. I asked what the result meant. They jumped to social assumptions, highlighting how they would’ve thought it would be younger, but maybe it had to do with the large number of homeless people in NYC. They didn’t turn to the data because they thought they knew the answer and just needed to explain it. I then asked them to run a distribution. Eyes lit up. So many people in the data are 0. Or 99. I then asked them to compare the age variable to the date-of-birth variable. For so many of the records, these two didn’t match. These students were tricked into learning the first rule of data science: data are messy. The key to grappling with data is to understand the weaknesses of your data before trying to ask questions.

When you build your tools, what assumptions do you make about your data? How do you help those who are looking to make sense of the data see the limitations in the data? How do you coax data into showing its weaknesses? How do you encourage data users to see uncertainty? These are choices.

The Politics of Precision

331,449,281. That is the number that the Census Bureau announced after it finished its data processing. A precise number to represent the population of the United States on April 1, 2020. A number that is politically necessary but scientifically confounding. Anyone with two senses knows that a census is bound to miss people, bound to count some people twice. There is simply no way that, during a pandemic, the US government managed to count a number that precise. Moreover, if you know anything about the processing that goes into a census, you’d know that many of the procedures that are designed to clean up the data and fill in the gaps introduce all sorts of uncertainty along the way. So why then does the Census Bureau announce a number that precise?

The Constitutional framers of the United States could not have believed that the US marshals — the law enforcement branch initially tasked with conducting a census — could successfully count every resident of the new country, But they too came back with a precise number in 1790: 3,929,214. Hogwash said George Washington. He was convinced that the nascent country was bigger than that! Moreover, he was offended that the first reapportionment of the Congress was rubbish because it would create unevenness in representation. And so, the first president of the United States issued his first veto, vetoing the reapportionment plan of Congress. Thomas Jefferson intervened with a solution: math. His plan was to reduce “the apportionment always to an arithmetical operation, about which no two men can ever possibly differ.”

130 years later, two men did differ about how a population could be divided. There was a day in 1921 where an economist and a statistician walked into a room… Only the economist had published an op-ed in the New York Times and the statistician was trying to be a peacemaker. And the room contained a room full of politicians who didn’t care about math. But a few of them saw an opportunity. They encouraged the bickering scientists to go on bickering and told their fellow politicians that they shall not try to reapportion Congress until the scientists reached a resolution. Nine years later, Congress had still failed to reapportion itself, allowing conservatives to control an unfair share of the House and institute a range of anti-immigration policies. Furious, another politician brokered a compromise. The key to fights over math was to formalize an algorithm once and for all. And such is the reason that the United States only has 435 representatives in the House. And the reason that a precise number is required to be inserted into the algorithm for automatic apportionment.

But this fetishization for precision has consequences. If the country had followed Washington’s guidance, the size of the House of Representatives would be over 11,000 members. If statisticians could be honest about the limitations of the data, Congress would be forced to grapple with uncertainty when they defined laws that depend on census data. But Congress doesn’t do math. And so we are regularly forced to contend with the illusion of precision.

And that illusion of precision trickles down into countless visualizations about the population of the United States. And since the population of the United States is the base number for countless other calculations — GDP, vaccine distribution, etc. — the illusion of precision gets baked into countless datasets and countless visualizations. And if you want to get into a heated debate with census advocates, try noting that all data have noise. Over the next year, you can expect a robust fight over differential privacy as a technique used to protect statistical confidentiality in the census precisely because that technique requires acknowledging and working with statistical noise. This is heresy when the illusion of precision is so dear.

Biases All the Way Down

Data aren’t just noisy, they’re also biased. After all, data are socially constructed. Work with me here… I know that this statement is heretical in some contexts and a given in others. But take a moment to think about the categories used within data. In population data, we categorize people by race, gender, and geography. The boundaries of geography are shaped by politics and by choices that geographers make when looking at the natural environment. A gender binary is formed socially and loosely tethered to differences in biological features that are not as binary as people seem to think. And then there’s race. Even when asking people to self-identify their race, those choices are shaped by cultural forces in countless ways.

But it is not just population data that are categorically contentious. If you’re working with scientific data, you often have to construct categories to segment data. If you work with networks, you have to identify the best min-cuts to use to slice the data. Machine learning techniques are fundamentally about segmenting the data. These are guided by mathematical formulas, but the choice to create segments is socially determined.

Once you create those categories, you then have to deal with all of the data that don’t fit nicely into the categories. And you also have to deal with the data that got twisted into the categories for political purposes. And if those categories have social, political, or economic power, you have to deal with how some categories will be given preferential treatment.

Society is filled with inequities. People harbor biases and prejudices. For all that people imagine computing as a great disrupter, the irony is that many of our computing practices are more obsessed with reifying the categories that humans create than disrupting them. What is machine learning? In effect, it’s about identifying the categories that humans have socially constructed and computationally identifying them in systems that typically amplify them. This is what makes machine learning and artificial intelligence so politically controversial.

Consider research on word embeddings. Given the widespread use of public texts to train ML and NLP models, researchers wanted to understand how the biases of earlier texts affect the models. Sure enough, ML systems quickly learn that nurses are female and doctors are male. This system didn’t learn an intrinsic fact, but a socially constructed one. When a model with this bias is then put into a system that uses its finds, it is then in the position to reinforce those biases. Left unchecked, auto-complete in your email would happily encourage you to maintain the gendered norms found by machine learning models.

When you visualize data with biases in it, are you designing your tool to reveal those biases or to reify them?

Epistemic Warfare

The word statistics was created in 1847 to reference a practice that originally went by a different name: political arithmetic. Statistics referred to quantitative work of state knowledge production. The product of statistics was to be data. But not data as we currently understand it, data in its original Latin sense, where data referred to information that was given. The work of government statistics was to produce and give data. To not invite questions about that data, to not doubt that data. Official statistics were to be neutral gifts of the state.

Data are never neutral. They are biased. They are rife with uncertainty and limitations and all sorts of other imperfections. But for data to be legitimate in the eyes of non-technical actors, data must be performed as precise and objective and neutral. This creates a conundrum from anyone whose practice relies on communicating data. When high-powered people want to rely on data as truth, they don’t want to be faced with confidence intervals or error bars. They want to be told that the data are reliable, by which they mean accurate, by which they mean a perfect representation of whatever they wanted to measure. Ignorance is bliss. It’s also political.

Epistemology is the study of knowledge. How we know what we know. We use tools to help us know things, but those tools also reflect our values, worldview, and ways of knowing. Those who take a mathematical approach to sense-making rely on proofs to understand the world. Scientists may use instruments. Other people may leverage religious texts or memories of past experiences to anchor how they see a situation. The epistemic framework of most politicians is flexible.

Just as epistemology is the study of knowledge, agnotology is the study of ignorance. Scientists typically think of ignorance as constituting that which we don’t yet know. But agnotology scholars argue that there are two other important kinds of ignorance: 1) that which is forgotten; and 2) that which is purposefully polluted. Forgotten knowledge can include indigenous knowledge, erased through genocide, but it can also include that person in your office who knew The Thing and now no one knows The Thing. But polluted knowledge is a whole other beast. Polluted knowledge is structured to seed doubt when consensus emerges, to fragment people and ideas as a form of control.

Scholars of ignorance coined the term agnotology in response to late 20th century efforts to undermine consensus around climate change and the relationship between tobacco and cancer. The goal of industry-backed actors and their political collaborators was to seed doubt so as to delay policy or regulatory action. To do so, they published articles in journals and leveraged visualization to show uncertainty. Not uncertainty in a data-centered sense, but uncertainty in a political sense. To delegitimize the data through uncertainty. This attack on science for political and economic purposes had serious ramifications, making policy action impossible while also undermining the scientific process.

There is a lot of literature in the visualization community that talks about trying to convey uncertainty. There is even a decent amount that highlights politicized uses of data. After all, Edward Tufte understood this dilemma in the 1970s when he was trying to teach journalists about statistics. But the adversarial nature of information has evolved significantly since then. Data aren’t just politicized — they are actively perverted and used to undermine science-oriented work.

When you build a visualization, you must account for how your work could be twisted to enable ignorance not just knowledge. To create misperception. Misinformation and disinformation aren’t simply attacks on political speech; they are epistemic attacks designed to undermine all forms of evidence. So think like a hacker and consider how to secure your visualization work to prevent it from becoming a tool for misinformation.

To See With Clear Eyes

I fell in love with visualization as a tool when I realized that it could help me see complex information in a better light. I wanted to understand the networks formed as traces of people and practices. Networks have high dimensionality to them. Paper is two-dimensional. If you skilled like an artist, you may be able to convey three or perhaps four dimensions. But the beauty of interactive visualization is that you can keep turning a network over and over again, seeing the graph from a different vantage point each time.

The first interactive visualization I created as an undergraduate student leveraged spring models to lay out the graph, simply because that’s what my advisor told me to do. Yet, in animating those visualizations, I found myself enamored with how the spring system forced me to remember that nothing about the structure I had created was static. Everything was relational.

Visualizations are powerful tools. They allow us to explore data, to make sense of what our data might be hiding about themselves. They allow us to communicate data, revealing aspects to data that are hard to grasp. They can also be used to assert authority, in both productive and dangerous ways.

One of my mentors told me that companies are in their most precarious state when their internal self-awareness is maximally at odds with the external perception of the corporation. You can think of this as Facebook’s current problem. But it was also a problem for Microsoft when the company was navigating its anti-trust case. He said that the work of good corporate communications is to narrow that gap, to ensure that the internal and external sense of reality are closely aligned. That’s not as simple as it seems. Those who build companies, those who work at companies… their identities are often aligned with the company. They see the company in a positive light, even when its under attack. Like horses who are afraid of the dark, they put blinders on to prevent themselves from being demoralized and frustrated.

On election night in 2016, the New York Times presented an absurd visualization of the probability that each candidate would win. That damn needle was baked with uncertainty. You could find the details of the margins of error in the fine print. But all that the public saw, all that the editors of the New York Times saw, all that the journalists saw was a binary: Hilary Clinton was going to win. After the election, pollsters everywhere justified their work by noting that a Trump win was within the margin of error! But that’s not at all what the visualization that nearly everyone obsessing over that election had seen. The distance between what the creator of the needle saw and what those who used the darn thing to understand the election was profound.

When you build a visualization tool, you will want to see it for all that it can be, for all that it can do. That is only natural. It takes significant effort to see the complexity of your own work. But doing so is important. Visualizations have power. They can convey information and amplify certain interpretations. This means that they are political artifacts, regardless of what you may wish them to be. My ask of you today is simple: pay attention to the limits and biases of your data and the ramifications of your choices. Put another way, there be dragons everywhere. Design with care and intention, humility and flexibility.

Thank you!

researcher of technology & society | Microsoft Research, Data & Society, NYU | zephoria@zephoria.org