Smarter Social Media Analytics Underhoodilla

IMG_20170403_161015Vietimme viikon 3.-7.4. SSMA-projektin tutkijaporukkalla startupyritys Hupparihörhön luona. Kenttätyöviikon tarkoituksena oli tutustua Hupparihörhön kehittämään Underhood-palveluun, joka mittaa yritysten mainetta sosiaalisen median dataan perustuen.

Underhood on ollut lähikuukausina uutisissa toteutettuaan Aamulehden kanssa kuntavaalitulosta Tampereella ennustavan mainemittarin. SSMA-projektin puolesta olemme kiinnostuneita tutkimaan, miten sosiaalisen median datalla voidaan mitata ja ennustaa yhteiskunnallisia ilmiöitä. Viikko Underhoodilla kuntavaalien alla tarjosi erinomaisen mahdollisuuden seurata konkreettisen tapauksen ennustamista somedatasta perehtyen samalla Underhoodilaisten näkemyksiin data-analytiikan mahdollisuuksista.

Alkuviikon aikana tutustuimme Underhoodin somedatasta yrityksille laskemaan mainepisteytykseen, joka perustuu kolmelle eri mittarinarvolle. Ensinnäkin Underhood seuraa yritysten näkyvyyttä, joka lasketaan yrityksen Facebook-sivun saamien tykkäysten ja Twitter-seuraajien määristä sekä Facebookin antamasta buzz-arvosta. Toiseksi Underhood mittaa yritysten dialogia someyleisön kanssa, joka määrittyy yrityksen keskimääräisen julkaisumäärän ja julkaisujen saamien kommenttien, tykkäysten sekä jakojen perusteella. Lisäksi dialogimittarin arvoon vaikuttaa yrityksen vastausaste saamiinsa kommentteihin. Kolmanneksi mainepisteytykseen vaikuttavat yrityksen ja yleisön käyttämien sanojen samanlaisuus ja sentimenttianalyysilla saatu yleisön kommenttien sävy. Näkyvyyttä, dialogia ja samanlaisuutta mittaavat pisteet skaalataan Underhoodissa asteikolle 0-10. Näiden pisteiden keskiarvosta lasketaan sitten varsinainen mainepisteytys, jonka arvo on myös asteikolla 0-10.

Underhood-pisteytyksen ja sen tekijöiden tarkastelussa meitä kiinnostaviksi kysymyksiksi nousivat eri mittareiden väliset suhteet ja mittauskohteet: mitä oikeastaan mitataan, kun kerätään dataa esimerkiksi yritysten ja Facebook-tykkääjien yhteisesti käyttämien sanojen määristä? Millä perustein voisimme ajatella, että somedatan pohjalta määritetyt mittarit olisivat luotettava ennustaja ilmiöille, joita koskeva uutisointi ja keskustelu eivät rajoitu sosiaaliseen mediaan?

Underhood-pisteytys on aikaisemmin ennustanut oikein esimerkiksi semifinalistien valinnan Ison-Britannian X-Factor -ohjelmassa, mutta kilpailun voittajan ennuste sen sijaan oli väärä. Yksi selitys tälle on, että ennusteen sotki kansainvälisen sosiaalisen median osoittama kiinnostus: X-Factor UK:ssa äänestäminen oli mahdollista ainoastaan Isossa-Britanniassa, mutta Underhood-pisteytys heijasteli finalistien suosiota kansainvälisellä tasolla. Tämä on esimerkki tapauksesta, jossa sosiaalisen median dataan perustuvat mittarit mittaavat ennusteen kohteesta (pärjääminen Ison-Britannian X-Factorissa) erillistä ilmiötä (suosio kansainvälisessä sosiaalisessa mediassa).

Saimme käyttöömme Underhoodin pisteytyksen perustana olevan datan, joka sisälsi eri mittareiden arvot ja näihin vaikuttavat tekijät päiväkohtaisesti tallennettuina elokuulta 2016 alkaen (dataa yhteensä 3958 yritykseltä ja poliitikolta). Viikon aikana tutkimme datan eri muuttujien riippuvuutta toisistaan ja vertasimme eri mittarinarvojen muutosta muun muassa yritysten toimialan ja liikevaihdon suhteen. Kiinnostavasti havaitsimme, että yritysten ja someyleisön kielenkäytön keskinäinen samanlaisuus korreloi yritysten julkaisujen saaman kommenttimäärän kanssa. Tämä viittaisi siihen, että samanlaisuusmittari saattaa kuvata käytetyn kielen yhteneväisyyden lisäksi myös keskustelun volyymia sosiaalisessa mediassa.

Eri sosiaalisen median dataan perustuvien mittareiden välinen “työnjako” vaikuttaisi olevan yksi merkittävä tekijä ilmiöiden ennusteiden arvioissa. Ennusteen luotettavuuden arvioinnissa on tärkeää tietää, mitä ilmiön puolia eri mittarit mittaavat. Erillisiä mittareita käytettäessä olisi hyvä varmistua, että ilmiön eri puolia mittaavat pisteytykset eivät riipu keskenään samoista tekijöistä, kuten esimerkiksi keskustelun aktiivisuudesta. Näin mittareiden keskinäisen tärkeyden tai painotuksen arviointi selkeytyy ennustetta muodostaessa.

Underhoodin mittareiden keskinäiset painotukset ovat viime päivinä nousseet esille myös Aamulehden mainemittarin antamien kuntavaaliennusteiden yhteydessä. Mittarin ennusteet eivät lopulta vastanneet kuntavaalitulosta Tampereella. Aamulehdessä tämän arvioitiin johtuvan ehdokkaiden suuresta määrästä ja suhteellisesta vaalitavasta, jotka vaikeuttivat mainemittarin antamien lukujen tulkintaa. Underhoodin mukaan mittari taas olisi antanut tarkemman tuloksen, jos Facebook-tykkääjien määrää olisi painotettu nykyistä enemmän.

Keskeisellä sijalla tässäkin tapauksessa on kysymys: minkä ehtojen vallitessa voimme pitää sosiaalisen median suosiota luotettavana ennustajana vaalitulokselle? Underhoodin mainemittarin antamat ennusteet eri ilmiöistä – olivat ne sitten onnistuneita tai eivät – tuottavatkin hyödyllistä aineistoa, jonka avulla tätä ongelmaa voidaan tutkia.

How to study Big Data epistemology in the social sciences?

In the recent years there has been discussion about whether the rise of Big Data—understood as a collection of methods and practices involved in the analysis of voluminous and rapidly accumulating data with varying structure—calls for a new kind of epistemological understanding of science (e.g. Kitchin 2014; Frické 2015; Floridi 2012; Hey et al. 2009). For instance, Rob Kitchin proclaims that

There is little doubt that the development of Big Data and new data analytics offers the possibility of reframing the epistemology of science, social science and humanities, and such a reframing is already actively taking place across disciplines. (Kitchin 2014, 10.)

This epistemological reframing is due to the idea that Big Data enable a novel form of inquiry called data-driven science, which seeks to generate scientific hypotheses by discovering patterns in vast amounts of data (Kelling et al. 2009, 613-614; Kitchin 2014, 6-7). Data-driven science contrasts with the more traditional ‘knowledge-driven science’, where the hypotheses to be examined are derived from theory rather than data (Kelling et al. 2009, 613). Thus, the argument is that Big Data can reorient the roles that data and theory play in research, and that therefore we should rethink our conception of how scientific knowledge production works.

How, then, should one go about studying Big Data epistemology? How to assess the claim that Big Data enable a novel form of scientific inquiry, which cannot be analysed using traditional epistemological concepts?

In the context of biology, Sabina Leonelli has argued convincingly that in order to critically evaluate the epistemological novelty of Big Data, ‘one needs to analyse the ways in which data are actually disseminated and used to generate knowledge’ (Leonelli 2014, 2). This is quite plausibly so in the context of the social sciences, too. As Kitchin and McArdle (2016) argue, there is no single notion of ‘Big Data’ that would apply across all contexts, and accordingly the ways in which knowledge is generated are likely to vary as well.

Thus it seems sensible that a study of Big Data epistemology in the social sciences should begin with an analysis of the different ways in which Big Data are used in different social scientific contexts. With this purpose in mind, I have collected a number of special issues, sections, and symposia on Big Data that have been published in social scientific journals in the past few years (2013-2016). A review of the different conceptions and uses of Big Data in this collection should give some basis for an assessment of the extent to which the epistemology of the social sciences needs to be reframed.

Below is a list of the collected issues along with short descriptions of their contents.

Special issues, sections, and symposia on Big Data

Political Behavior and Big Data
International Journal of Sociology 46(1), 2016.

The articles in this special issue come from political sociology, cross-national methodology, and computer science. The purpose of the issue is to identify and discuss a set of pressing methodological problems pertaining to the use of Big Data methods in these fields, including the following:

  1. Can Big Data tools be used to describe and explain political behaviour?
  2. How to create a large numerical data set from textual data?
  3. How to deal with the problem of selection in constructing event data with Big Data methods?
  4. How to harmonize large volumes of survey data from distinct sources into one integrated data set?

Big Data in Psychology
Psychological Methods 21(4), 2016.

This special issue provides 10 articles that discuss the benefits of engaging psychological research with Big Data and give instructions for the use of various common research tools. The first four articles offer guides to using Big Data methods and tools in psychological research, giving advice on the use of various APIs and web scraping tools to collect data, as well as on managing and analysing large datasets. The remaining six articles then demonstrate the use of Big Data in psychology, examining the spread of negative emotion on college campuses, models of human declarative memory, methods of theory-guided exploration of empirical data, the uses of statistical learning theory in psychology, and methods for detecting the genetic contributions to cognitive and behavioural phenomena.

Big Data and Media Management
International Journal on Media Management 18(1), 2016.

The stated goal of this special issue is to showcase media management research that employs Big Data, or analyses its use in media management (see the issue introduction, 1-2). The issue includes four research articles, which use Big Data to derive metrics for audience ratings, identify influential factors in terms of news sharing, discuss television use measurement, and examine consumers’ willingness to share personal data.

Special Issue on Big Data
Journal of Business & Economic Statistics 34(4), 2016.

This special issue includes six articles on Big Data finance and seven articles on macroeconomics, high-dimensional econometrics, high-dimensional time series and spatial data. The articles discuss a variety of issues in these fields, developing theory and methods for addressing them as well as investigating applications. (See the issue introduction, 2-3.)

Transformational Issues of Big Data and Analytics in Networked Business
MIS Quarterly 40(4), 2016.

This special issue consists of eleven research articles, which develop a variety of Big Data analysis methods relevant for information systems and business. Included are a data-driven tree based method for assessing interventions in the presence of selection bias; network methods combining sentiment and textual analysis for developing brand advertising; methods for using fine-grained payment data to improve targeted marketing; a study of the causal effectiveness of display advertising; a model to improve resource allocation decisions; a crowd-based method for selecting parts of data as model input; methods for dealing with the scalability and privacy of data sharing; a utility-theory based structural model for mobile app analytics; a predictive modeling method for business process event data; a topic modelling method for measuring the business proximity between firms; and a method to address various wicked problems of societal scale in information systems. (See the issue introduction, 815-817.)

Toward Computational Social Science: Big Data in Digital Environments
The ANNALS of the American Academy of Political and Social Science 659(1), 2015.

The articles in this special issue on Big Data and computational social science come from diverse disciplines, including psychology, epidemiology, political science, and communication studies. The twenty research articles included in the issue are divided into five subsections, titled ‘Perspectives on Computational Social Science’, ‘Computer Coding of Content and Sentiment’, ‘Mapping Online Clusters and Networks’, ‘Examining Social Media Influence’, and ‘Innovations in Computational Social Science’. Each of these sections contains four articles with discussions of the section theme or empirical studies using Big Data.

Big Data, Causal Inference, and Formal Theory: Contradictory Trends in Political Science?
Symposium in Political Science & Politics 48(1), 2015.

The purpose of this symposium is to discuss whether formal theorising, causal inference-making such as experimentation, and the use of Big Data hinder or benefit from each other in political science. The seven articles included in the symposium agree that while there are limits to the extent to which Big Data can help solve problems in theoretical development of causal inference, the three should not be seen as contradictory to each other. In many cases Big Data can supplement the other two.

Section on Big Data
Sociological Methodology 45(1), 2015.

This section focusing on Big Data includes two articles. The first of these develops methods for analysing large-scale administrative datasets to yield econometric measures for urban studies. The second argues for a supervised learning method for analysing unstructured text content that combines machine-based and human-centric approaches.

Big Data, Big Questions
Special section in International Journal of Communication 8, 2014.

This special section includes eight articles which discuss political, ethical, and epistemological issues pertaining to Big Data. The issues discussed in the articles include power asymmetries related to data access; meanings attached to the term ‘Big Data’ in different discourses; the implications for democratic media of the use of Big Data in market advertising; problems pertaining to simplifications and standardizations in large-scale data sets; transparency in Twitter data collection and production; the uses and limitations of spatial Big Data; understanding the practices of the Quantified Self Movement; and the relationship between theory and Big Data.

Big Data in Communication Research
Journal of Communication 64(2), 2014.

This special issue includes eight research articles that use Big Data to address various questions in communication research. The questions addressed include agenda formation in politics; organizational forms of peer production projects; temporal dynamics and content of Twitter messages during elections; the relationship between television broadcasts and online discussion and participation; the acceptance of anti-smoking advertisements; the measurement of political homophily on Twitter; and cross-cultural variation in the use of emoticons.

Symposium on Big Data
Journal of Economic Perspectives 28(2), 2014.

This symposium contains four articles that focus on discussing problems of Big Data analysis in economics and introducing machine learning techniques suitable for addressing them, applications of data mining to analysing high-dimensional data, the uses of data gathered in political campaigns, and privacy issues pertaining to the use of Big Data in economics.

Policy by Numbers: How Big Data is Transforming Security, Governance, and Development
SAIS Review of International Affairs 34(1), 2014.

This issue features essays characterising the role of data in international affairs. The themes discussed range from the effects of selection bias in data collection on policymaking and the potential to use Big Data to estimate slavery, to issues pertaining to the openness of data and data custodianship.

Big Data/Ethnography or Big Data Ethnography
Session in Ethnographic Praxis in Industry Conference 2013.

The EPIC 2013 session on Big Data examines the relationship between Big Data and ethnographic research. The five articles in this session discuss the value of ‘small’ personal data in business, develop tools for analysing qualitative Big Data, argue that Big Data and ethnography should both be viewed as interpretative approaches to analysing human behaviour, examine the discourses and practices surrounding data among technology designers and the health and wellness community, and investigate the use of mobile money using mixed ethnographic methods.

Big Data in Political Science
Political Analysis virtual issue 5, 2013.

This virtual issue is a collection of articles published in Political Analysis between 2005-2013 that showcase the uses of Big Data and methods for analysing it in political science. The uses demonstrated by the articles include the validation of survey reports of voting, validation of online experiments, development of techniques for identifying word usage differences between groups of people, spatial sampling methods based on GPS data, and the measurement of legal significance and doctrinal development in judicial politics. The methods introduced include various Bayesian approaches to Big Data analysis and a general method for statistical inference with network data.

References

Floridi, L. (2012): Big Data and Their Epistemological Challenge. Philosophy & Technology 25(4).

Frické, M. (2015): Big Data and its epistemology. Journal of the Association for Information Science and Technology 66(4).

Hey, T., Tansley, S., and Tolle, K. (Eds.) (2009): The fourth paradigm: Data-intensive scientific discovery. Redmond,WA: Microsoft Research.

Kelling, S., Hochachka, W., Fink, D., Riedewald, M., Caruana, R., Ballard, G., and Hooker, G. (2009): Data-intensive Science: A New Paradigm for Biodiversity Studies. BioScience 59(7).

Kitchin, R. (2014): Big Data, new epistemologies and paradigm shifts. Big Data & Society 1(1).

Kitchin, R. and McArdle, G. (2016): What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets. Big Data & Society 3(1).

Leonelli, S. (2014): What difference does quantity make? On the epistemology of Big Data in biology. Big Data & Society 1(1).