My fourth iteration of ”Programming to social scientists” class is coming to an end. Throughout the courses, I’ve emphasized the need not only to learn programming, but also to contextualize the programming in social sciences. Based on the student feedback, there is still work to do on this area, to ensure people can connect the programming (and computational thinking) to their domain of expertise.
One approach I’ve used is to read some of the canonical texts regarding computational social science, including the method-driven notation (Cioffi-Revilla, 2010) and data-driven approach (Lazer et al., 2009). In our computational social science study program, we’ve tried to adapt Revilla’s classification
- complex systems
- data extraction
- network analysis
as a framework to understand and classify different approaches. However, the more I’ve explored it, the more challenging it has become for me. My main challenge relates to data extraction, which includes several aspects of data collection, preprocessing and actual processing and analysis of the data. There are many stages involved in single label, in different stages of the research process. (However, note that I’m doing more research which applies data extraction methods, thus I’m much more familiar with these than complexity science or other related terms.)
Instead, the classification should aim to explore the role of computation in the research process in detail, such as
- data collection and preprocessing
- using computation to process data
- using computation to analyze data
When using programming to support data collection and preprocessing, the aim indeed is to use computation only to support only in the data collection, such as web scrapping. However, the data can be analyzed using other methods, such as qualitative analysis or quantitative analysis. This also includes necessary steps to process the data into format where it can be further analyzed.
However, this is the limit of computational analysis of data. Computation is used to process the data; that is, more details are computed to support the analysis in further stages. For example, many of the social network statistics are data processing – computing is used to produce numbers which can be used in detailed analysis. Similarly, most supervised learning methods – that is, computational tools used to classify the content – are used by social scientists as a processing step, something where the results of computation are used to elaborate the argument or in analysis.
Furthermore, computation can be used to not only produce predefined results, as highlighted above, but rather to seek out what may emerge from the data. I refer this as analysis of the data. Many simulation models have this kind of characteristics; based on set of assumptions and initial condition, computer computes what will happen – and this can be reported as the final outcome of the research. Similarly, unsupervised learning, such as clustering, seem to advance in this area.
So, what is the aim to highlight computation from the process perspective, asking what it can do throughout the research process? By looking computation as part of the research process, I assume students would be better prepared to consider where their own research could benefit from computational data analysis. This can hardly be achieved by looking existing lists of methods and understanding each of them.
Secondly, the approach discussed above highlights the importance of the research question. Different strategies can be applied to integrated computational data analysis for each research question, and each research question can have several different phases involved. Furthermore, the computational method can be considered as part of different phase depending on the exact framing of the research question.
(Cross-posted from my personal blog)