The thesis we are about to discuss examines a methodological debate around data that began in the social sciences during the late 2000s, boomed during the following decade, and by the early 2020s had already begun to fade away – or rather, to reorient itself on more recent and hot technological developments, most notably, generative AI.
The debate I’m talking about, of course, is the one around big data in social science. The term “big data” itself was popularized in the early 2000s in the information technology industry, where it referred to datasets so large and complex that they challenged existing tools for storage and analysis. However, in the context of social sciences, the notion is typically used to refer to something more specific: that is, novel large-scale digital datasets that became available for research around the year 2010, from various sources outside academic social science – such as, social media platforms, online sites and services, digital administrative systems, and so on.
For many social scientists, access to these new data seemed to have radically altered the opportunities available for research. New digital sources were variously argued to provide more comprehensive data than before, data with more detail, data on phenomena which previously had been difficult to study, as well as data that occurs without the researcher’s intervention. In some sense, there’s nothing very new about social scientists using large-scale datasets originating in non-academic sources. For instance, the practices of using register data from governmental sources have a centuries long history. However, the difference is that the digital datasets that emerged around the turn of 2010 quite suddenly provided access to very large volumes of new information for a wide range of fields where laboriously collected small-scale datasets had previously been the norm. On top of this, given that accessing and using the new digital data at scale requires the use of computational techniques from fields outside the social sciences, big data were associated with the possibility of methodological reform. Sometimes these claims took far-reaching formulations. For instance, in 2013, the prominent computational social scientist Duncan Watts1 suggested that the incorporation of digital data sources into social research could in the long run transform social science into “a computational discipline much as biology did in the 1990s”.
Unsurprisingly, many social scientists held highly critical attitudes toward such claims, and by extension to the new digital resources that they concerned. Critics argued that enthusiasm around the new data was inspired by a business-driven rhetoric that gave primacy to computational technologies at the expense of earlier approaches in social research. The concern was that uncritically embracing big data could, as the sociologist Alexander Halavais2 put it, ultimately lead to non-computational social scientists being “properly relegated to the dustbin of history”. A related worry concerned the origins of the data in platforms – such as social media – which researchers could neither control nor easily understand. This so-called “found” nature of most of the new digital datasets cast doubts on their viability as research resources, as well as on the position of social scientists as their users, as people who could claim to have the right kind of expertise and skills needed for interpreting them – that is to say, as legitimate interpreters of found digital data. Many contended that most of the computational research done around the new data sources at the time had been developing in disconnect from the aims, traditions, and concerns of earlier social science. And so big data became a contest over disciplinary authority: who has the expertise relevant for saying how these new datasets and methods were to be used for social research, and how could the credibility of interpretations and claims drawn from them be established? These questions were given increasing weight by calls made by both social and computational scientists to develop approaches to using the new data that would be credible from a specifically social scientific perspective.
My dissertation sets off against this background of debate, to investigate how this might happen. How do social scientists negotiate the credibility of digital big data as research resources, and what kind of conceptions of research do they simultaneously put forward? This is the overarching research question that the dissertation addresses.
In doing so, I had two main conceptual starting points. First, the dissertation draws on a long line of work in the field of Science and Technology Studies, and more recently the critical data studies literature, to view data as objects that always need to be imagined as such, as useful for certain kinds of analytical purposes, in order to be used as data in the first place. This idea has it that the same data can uphold a host of differing conceptions, which lead to different approaches to using them. To give an example, in interpretative research, interviews or other textual materials are usually approached by way of coding or closely reading the texts, to pay attention to the nuanced ways in which meanings or themes are constructed in them. By contrast, computational analyses of the same materials often start by counting the occurrence of different words in the texts to turn the texts into numeric matrices that can then be subjected to statistical modeling. These approaches diverge markedly in their understandings of what information in the texts is relevant, and what it means to analyze textual materials. The point is that the same goes for big data in social science: different conceptions of big data and computational methods imply different and sometimes conflicting ideas of research. My interest was to investigate how such ideas figure in, are drawn upon, and sometimes conflict in actual attempts to deploy new digital materials in and around social science.
My second conceptual starting point, namely, credibility, is also the conceptual core of the dissertation. By credibility I mean the ability or “capacity” of researchers to convince others that their uses of the new data and methods can be considered acceptable ways of producing knowledge. This notion resonates well with the situation where researchers try to grapple with data that originate in weird, foreign systems and processes, which have not been designed for academic research and provide only limited possibilities for checking and controlling the data. What kind of approaches might be judged as credible in such a situation, and how are they formulated and put together? In the dissertation I conceptualize these attempts as a persuasive process of credibility-building, where researchers seek to legitimate their approaches to data use by aligning them with the understandings and approaches of other, more established or authoritative areas of research. These could include, for instance, social scientific fields such as sociology, political science, or statistics, but also newer areas, previously not connected to social sciences, such as data science, machine learning, or computer science more broadly, which might provide criteria relevant for constructing credible approaches in the eyes of some audiences. The main point is that by aligning themselves with certain authoritative traditions or fields of research, the proponents of non-established approaches can construct for themselves a position in debate from which knowledge claims can credibly be made.
The dissertation comprises six original research articles that examine this process of credibility-building from several angles and across several contexts, including academic sociology, computational social science and its development over the past two decades, commercial social media analytics – where analysts try to build approaches to data use that are credible and interesting from the perspective of their clients – and finally interpretative text analysis on the borderline of social and humanistic research. Four of the articles are empirical studies that draw on a variety of materials and methods. Two of them, namely the studies of interpretative text analysis, are theoretical discussions, which investigate ideas of objectivity in text analysis that uses machine learning. Taken together, the articles provide a view of big data in social science as what I call a conceptual phenomenon. Here, emphasis is on how researchers understand, approach, conceptualize and argue for or against big data and computational methods, over and above the actual digital inscriptions and technical procedures that make up the data, sometimes called the “material aspect” of big data. Doing so, I argue, helps shed light on some of the central issues discussed in the critical debates around social science big data, including 1) the issue of the rhetorical workings of big data and claims about the challenge that they pose for the status of social scientists; 2) the role that data foundness has in this picture: how is it that found data becomes problematic; 3) the way in which conceptions about new data and methods figure in attempts to use these resources within actual social scientific research.
In this regard, based on its sub-studies, the dissertation makes three main claims.
First, it argues that social scientists can and often do command quite high authority in dictating credible approaches to found digital data use within their respective fields; however, this authority might not easily extend elsewhere.
Second, this means that establishing widely accepted approaches to data use becomes an organizational challenge of reconciling differing and possibly conflicting conceptions. Importantly, this does not reduce to merely handling technical issues around problematic data, but instead can concern deeper differences in epistemic ideas of acceptable research.
Third, and finally, the diversity and distance between contexts where the data are produced and used make this organizational challenge persistent. This I argue is one of the primary ways in which data foundness complicates attempts to establish credible lines of computational social research. The diversity of fields involved in debating new data and methods within social science alone is high, and extending the debates outside social scientific contexts, as has happened in social science big data, only makes the issue more difficult, as none of these fields can claim prima facie authority in dictating for the others how the found data should be used.
When I was beginning to work on this dissertation in 2017 big data was still a hot notion, and new methods such as unsupervised machine learning for text analysis were considered state-of-the-art in emerging computational social research. Now, several years later, I often hear mention of these techniques as being “old school” or “outdated”. The notion of big data is rarely used anymore, and if so, more often ironically than not. This is perhaps not surprising, given everything that has happened over the past few years around technologies such as artificial intelligence. Likely, the efforts of many social media platforms to close down research access to their data have also contributed to the situation. By any account my intention is not to claim that we should go back to talking in terms of big data – on the contrary I think the notion was in many ways confusing.
Yet, I do believe that there lies a danger in shifting focus too frequently, in chase of hyped new methods, when work to establish the practices and conceptions for using the previous generation of techniques was still in a state of flux. The current discussions around generative artificial intelligence in many ways resemble the earlier big data debates in social research – starting from the commercial origins of both, to difficulties in understanding and validating where their results come from, and concerns that they might end up supporting certain kinds of research over others. For instance, what happens when models become simultaneously more opaque, more difficult to understand, and easier to use and access by researchers with widely different methodological backgrounds and technical skills? Glossing over the earlier big data debates in thinking about such questions would make comparisons difficult, risk re-inventing the wheel at best, and reproduce very recent difficulties at worst.
Finally, for the record, in an earlier version of this talk I had a series of data-related jokes in the beginning, which I hesitated to keep there. Ultimately I asked ChatGPT to tell me whether the jokes were any good. It said they were hilarious – but only after I’d told it they were supposed to be funny. So there you have it. No humour. Just a lot of data.
The dissertation is available at: https://helda.helsinki.fi/handle/10138/603223
- Watts, D. (2013). Computational social science: Exciting progress and future dimensions. The Bridge 43(4), 5–10. https://www.nae.edu/106118/Computational-Social-Science-Exciting-Progress-and-Future-Directions ↩︎
- Halvais, A. (2015). Bigger sociological imaginations: Framing big social data theory and methods. Information, Communication & Society 18(5), 583–594. https://doi.org/10.1080/1369118X.2015.1008543 ↩︎

