计算扎根方法的演进与比较,争论
计算扎根理论的方法论框架与人机协同推理模式
这些文献奠定了计算扎根理论(CGT)的理论基石,探讨了如何将社会科学的推理性逻辑(归纳、演绎、溯因)与机器学习的计算能力相结合。重点在于构建“人机协同”的迭代框架,强调通过算法增强研究者的注意力,同时保持定性研究对意义解释的深度和理论生成的本质。
- A machine learning model of cultural change: Role of prosociality, political attitudes, and Protestant work ethic.(Abhishek Sheetal, Krishna Savani, 2021, The American psychologist)
- Researcher reasoning meets computational capacity: Machine learning for social science.(Ian Lundberg, J. Brand, Nanum Jeon, 2022, Social science research)
- With eyes of a machine: A three-step guide for applying machine learning to visual content analysis in social research(Anna Helene Kvist Møller, Massimo Airoldi, 2025, Big Data & Society)
- Machine Learning as Grounded Theory: Human-Centered Interfaces for Social Network Research through Artificial Intelligence(Lorenzo Barberis Canonico, Nathan J. Mcneese, Chris F. Duncan, 2018, Proceedings of the Human Factors and Ergonomics Society Annual Meeting)
- Causal Machine Learning: A Deductive–Inductive Framework for Sociological Research(Nanum Jeon, Jennie E. Brand, 2026, KZfSS Kölner Zeitschrift für Soziologie und Sozialpsychologie)
- Big Data & Inductive Theory Development: Towards Computational Grounded Theory?(N. Berente, S. Seidel, 2014)
- A novel, human-in-the-loop computational grounded theory framework for big social data(Lama Alqazlan, Zheng Fang, Michael Castelle, Rob Procter, 2025, Big Data & Society)
- Text as Data: A New Framework for Machine Learning and the Social Sciences(K. Freeman, 2023, Contemporary Sociology)
- Application of machine learning to understand child marriage in India(A. Raj, Nabamallika Dehingia, Ashutosh Kumar Singh, L. Mcdougal, Julian McAuley, 2020, SSM - Population Health)
- Using Natural Language Processing and Qualitative Analysis to Intervene in Gang Violence: A Collaboration Between Social Work Researchers and Data Scientists(D. Patton, K. McKeown, Owen Rambow, J. Macbeth, 2016, ArXiv)
计算扎根方法在复杂社会议题中的实证应用与话语分析
这组文献展示了CGT在多元社会场景下的实证价值。研究涵盖了性别与种族歧视、民粹主义逻辑、LGBTQ+群体压力、COVID-19风险感知、极端化过程以及公共卫生政策话语等。通过自动化工具(主题模型、分类器)与定性解读的结合,研究者能够从大规模社交媒体或新闻数据中提取深层的社会动态与文化内涵。
- The Language of LGBTQ+ Minority Stress Experiences on Social Media(Koustuv Saha, Sang Chan Kim, Manikanta D. Reddy, A. J. Carter, Eva Sharma, Oliver L. Haimson, M. Choudhury, 2019, Proceedings of the ACM on Human-Computer Interaction)
- The stories about racism and health: the development of a framework for racism narratives in medical literature using a computational grounded theory approach(Caroline Figueroa, Erin Manalo-Pedro, Swetha Pola, Sajia Darwish, Pratik S. Sachdeva, Christian Guerrero, Claudia Von Vacano, Maithili Jha, Fernando De Maio, Chris J Kennedy, 2023, International Journal for Equity in Health)
- A computational grounded theory based analysis of research on China’s old-age social welfare system(Yingying Li, N. Mi, Xinyue Pan, Chao Ma, Zhenyu Sun, Geraldo Timoteo, 2025, Frontiers in Public Health)
- Cyborg Imaginaries: A Computational Grounded Theory of Online Pioneer Community Discussions on Human Augmentation(Giulia Frascaria, Daniela Jaramillo-Dent, Michael Latzer, 2026, AoIR Selected Papers of Internet Research)
- The discursive logics of online populism: social media as a “pressure valve” of public debate in China(Kun He, Scott A. Eldridge, M. Broersma, 2023, Journal of Information Technology & Politics)
- Investigating Perception of Gender Stereotypes in Large Language Models: A Computational Grounded Theory Approach(R. Salvi, Nigel Bosch, 2025, ACM Journal on Responsible Computing)
- Exploring Conceptualizations of COVID-19 Risk in Ideologically Distinct Online Communities: A Computational Grounded Theory Analysis(Tiwaladeoluwa B. Adekunle, Jeremy Foote, Toluwani E. Adekunle, Nathan TeBlunthuis, Laura K. Nelson, 2025, Journal of Medical Internet Research)
- The plurality and shifting of framing genetical modification risks on Chinese social media(Xiaoxiao Cheng, 2024, Health, Risk & Society)
- To the Extreme: Exploring the Rise of a Deviant Culture in a Misogynist Digital Community(Yongren Shi, Kevin Kiley, Stephanie M. DiPietro, 2024, Socius)
- Artificial Intelligence in Brazilian News: A Mixed-Methods Analysis(Raphael Hernandes, Giulio Corsi, 2024, ArXiv)
- From Strange to Normal: Computational Approaches to Examining Immigrant Incorporation Through Shifts in the Mainstream(Andrea Voyer, Zachary D. Kline, Madison Danton, T.G. Volkova, 2022, Sociological Methods & Research)
- Mapping the discursive dimensions of the reproducibility crisis: A mixed methods analysis(Nicole C. Nelson, Kelsey Ichikawa, Julie Chung, M. Malik, 2020, PLoS ONE)
- Unsupervised Machine Learning to Detect and Characterize Barriers to Pre-exposure Prophylaxis Therapy: Multiplatform Social Media Study(Qing Xu, Matthew C. Nali, Tiana J McMann, Hector Godinez, Jiawei Li, Yifan He, Mingxiang Cai, Christine Lee, Christine Merenda, Richardae Araojo, T. Mackey, 2021, JMIR Infodemiology)
- Everyday life information experiences in Twitter: a grounded theory(Faye Q. Miller, Kate Davis, Helen Partridge, 2019, Inf. Res.)
技术工具的演进:计算辅助编码、解释性AI与NLP管道
此类文献侧重于技术层面的创新,旨在解决大规模文本处理中的效率与透明度问题。研究内容包括利用非监督学习(聚类)、张量分解、针对特定语种的NLP管道以及LLM辅助编码工具,旨在提升主题发现的清晰度、自动化编码的准确性及计算过程的可解释性。
- Social Science for Natural Language Processing: A Hostile Narrative Analysis Prototype(S. Anning, G. Konstantinidis, Craig Webber, 2021, Proceedings of the 13th ACM Web Science Conference 2021)
- From Data to Discovery: Unsupervised Machine Learning's Role in Social Cognition(Jonathan E. Doriscar, Michalis Mamakos, Sylvia P. Perry, Tessa E. S. Charlesworth, 2025, Social Cognition)
- Identification of social scientifically relevant topics in an interview repository: a natural language processing experiment(Judit Gárdos, Julia Egyed-Gergely, Anna Horváth, Balázs E. Pataki, R. Vajda, András Micsik, 2023, J. Documentation)
- Using Supervised Machine Learning to Code Policy Issues(Bjorn Burscher, R. Vliegenthart, Claes H. de Vreese, 2015, The ANNALS of the American Academy of Political and Social Science)
- Unmasking Machine Learning With Tensor Decomposition: An Illustrative Example for Media and Communication Researchers(Yu Won Oh, Chong Hyun Park, 2025, Media and Communication)
- DelibAnalysis: Understanding the quality of online political discourse with machine learning(Eleonore Fournier-Tombs, G. Di Marzo Serugendo, 2019, Journal of Information Science)
- Using Machine Learning to Support Qualitative Coding in Social Science(N. Chen, Margaret Drouhard, Rafal Kocielnik, J. Suh, Cecilia R. Aragon, 2018, ACM Transactions on Interactive Intelligent Systems (TiiS))
认识论争论、公平性反思与计算严谨性评价
这组文献反映了对计算方法在社会科学中应用的批判性思考。讨论集中在算法中潜伏的性别与种族偏见、机器学习公平性的定义冲突、以及如何通过“交叉性理论”和人类反馈来建立新的严谨性标准。同时,探讨了计算模拟是否符合科学方法以及聚类分析在社会学理论解释中的局限性。
- Investigating gender and racial-ethnic biases in sentiment analysis of language(Steven Zhou, Arushi Srivastava, 2024, Cogent Psychology)
- Sociolinguistic auto-coding has fairness problems too: measuring and mitigating bias(Dan Villarreal, 2024, Linguistics Vanguard)
- A Computational Method for Measuring "Open Codes" in Qualitative Analysis(John Chen, Alexandros Lotsos, Lexie Zhao, Jessica Hullman, Bruce Sherin, Uri Wilensky, Michael Horn, 2024, ArXiv)
- From Narratives to Code: Using Intersectional Methodology (IM) for Algorithmic Designs(Princess Chihurumnaya Samuel, 2025, Proceedings of the ACM Global Computing Education Conference 2025 - Volume 2)
- Studying Culture and Meaning Through Interpretative Computational Methods: From theory to method and back(Jan Goldenstein, Dennis Jancsary, Stine Grodal, Bernard Forgues, P. Devereaux, Dev Jennings, 2026, Organization Studies)
- The Scientific Method in the Science of Machine Learning(Jessica Zosa Forde, Michela Paganini, 2019, ArXiv Preprint)
- Sociology of values: experience of building a taxonomy by using natural language analysis technology(M. Kashina, S. Tkach, 2023, Digital Sociology)
- Sentiment and Language: A Socio-Semiotic Analysis(Junfeng Zhang, 2024, Philosophy Journal)
- Facebook and social representations of Filipino migrant life in Germany: a reflexive computational approach(Audris Umel, 2024, Frontiers in Human Dynamics)
跨学科扩展:心理构念测量、社会模拟与治理实践
该组文献关注CGT思想向更广泛领域的延伸。一方面是利用深度学习和LLM对复杂心理构念(如谦逊、道德感)进行量化测量与行为模拟;另一方面是将计算方法应用于城市数字治理、社会技术风险分析及计算机视觉推理,展现了计算扎根方法作为跨学科研究工具的广阔前景。
- Developing a Text‐Based Measure of Humility in Inquiry Using Computational Grounded Theory(Sarah Bratt, E. Leahey, Charlie Gomez, Jina Lee, Yeaeun Kwon, C. Lassiter, 2024, Proceedings of the Association for Information Science and Technology)
- MoralBERT: A Fine-Tuned Language Model for Capturing Moral Values in Social Discussions(Vjosa Preniqi, Iacopo Ghinassi, Julia Ive, C. Saitis, K. Kalimeri, 2024, Proceedings of the 2024 International Conference on Information Technology for Social Good)
- A Fuzzy-SNA Computational Framework for Quantifying Intimate Relationship Stability and Social Network Threats(Ning Wang, Xiangzhi Kong, 2026, Symmetry)
- Multi-Stage Simulation of Residents' Disaster Risk Perception and Decision-Making Behavior: An Exploratory Study on Large Language Model-Driven Social-Cognitive Agent Framework(Xinjie Zhao, Hao Wang, Chengxiao Dai, Jiacheng Tang, Kaixin Deng, Zhihua Zhong, Fanying Kong, Shiyun Wang, So Morikawa, 2025, Syst.)
- The Role of Influential Actors in Fostering the Polarized COVID-19 Vaccine Discourse on Twitter: Mixed Methods of Machine Learning and Inductive Coding(Loni Hagen, Ashley Fox, Heather O’Leary, DeAndre Dyson, Kimberly Walker, C. Lengacher, R. Hernandez, 2021, JMIR Infodemiology)
- Revealing the Role of Intra-household Dynamics in Computer Adoption: An Inductive Theorization Approach Using Machine Learning in the Indian Context(Sharada Sringeswara, Jang Bahadur Singh, S. Sharma, S. Gouda, 2025, Information Systems Frontiers)
- Engaging Comunalidad as Theory and Praxis in Language Reclamation(María Cecilia Schwedhelm, 2025, International Journal of Literacy, Culture, and Language Education)
- Text Mining for Social Good; Context-aware Measurement of Social Impact and Effects Using Natural Language Processing(R. Rezapour, 2020, Companion Publication of the 2020 Conference on Computer Supported Cooperative Work and Social Computing)
- ML-Schema: Exposing the Semantics of Machine Learning with Schemas and Ontologies(Gustavo Correa Publio, Diego Esteves, Agnieszka Ławrynowicz, Panče Panov, Larisa Soldatova, Tommaso Soru, Joaquin Vanschoren, Hamid Zafar, 2018, ArXiv Preprint)
- Data-theoretic methodology and computational platform to quantify organizational factors in socio-technical risk analysis(J. Pence, T. Sakurahara, Xuefeng Zhu, Z. Mohaghegh, M. Ertem, Cheri Ostroff, E. Kee, 2019, Reliab. Eng. Syst. Saf.)
- 数字治理视角下政府治理效能的影响因素研究(李凌丰, 窦文章, 2024, 现代管理)
- Grounded Reinforcement Learning for Visual Reasoning(Gabriel Sarch, Snigdha Saha, Naitik Khandelwal, Ayush Jain, Michael J. Tarr, Aviral Kumar, Katerina Fragkiadaki, 2025, ArXiv Preprint)
本报告综合了计算扎根方法(CGT)从理论框架构建到跨学科应用的演进全貌。研究不仅确立了以“人机协同”和“溯因推理”为核心的方法论逻辑,还在社会话语分析、心理构念测量等领域实现了深度的实证落地。同时,学术界围绕计算工具的偏见、公平性及认识论严谨性展开了深入争论,推动了从单纯的自动化编码向更具反思性、解释性和交叉性的计算社会科学范式转型。
总计52篇相关文献
The availability of big data has significantly influenced the possibilities and methodological choices for conducting large-scale behavioural and social science research. In the context of qualitative data analysis, a major challenge is that conventional methods require intensive manual labour and are often impractical to apply to large datasets. One effective way to address this issue is by integrating emerging computational methods to overcome scalability limitations. However, a critical concern for researchers is the trustworthiness of results when machine learning and natural language processing tools are used to analyse such data. We argue that confidence in the credibility and robustness of results depends on adopting a ’human-in-the-loop’ methodology that is able to provide researchers with control over the analytical process, while retaining the benefits of using machine learning and natural language processing. With this in mind, we propose a novel methodological framework for computational grounded theory that supports the analysis of large qualitative datasets, while maintaining the rigour of established grounded theory methodologies. To illustrate the framework’s value, we present the results of testing it on a dataset collected from Reddit in a study aimed at understanding tutors’ experiences in the gig economy.
Purpose By the end of 2024, 22% of the Chinese population was aged 60 and above, making old-age social welfare a critical challenge. Despite abundant literature, a gap remains between research and policy. This study applies Nelson’s computational grounded theory to systematically analyze China’s old-age social welfare research and propose targeted policy priorities. Methods We searched Chinese literature (2014–2024) from the Wanfang, CNKI, and CQVIP databases. After preprocessing the abstracts, we applied topic modeling using the latent Dirichlet allocation, guided by human analysts. Optimal topics were determined using perplexity and coherence metrics. Researchers then linked each topic to sociologically meaningful concepts to derive abstract policy conclusions. Results A total of 413 articles met eligibility criteria. Seven topics emerged: (1) the theoretical significance of social welfare policy; (2) enhancing rural old-age care; (3) providing care for special groups; (4) promoting a home-community care model; (5) optimizing precision care through collaborative mechanisms; (6) developing community culture; and (7) establishing supply-driven care services. Notably, topics two and seven dominated the literature. Conclusion Based on these themes, we propose policy priorities to enhance comprehensive social welfare programs. China’s big government model—a top-level design involving diverse stakeholders—may serve as an effective framework for addressing a global aging society marked by rising non-communicable diseases and AI-driven economic growth. Moreover, our computer-assisted approach offers a valuable method for information scientists, aiding policymakers in navigating extensive digital data for more cost-effective and timely decision-making.
Artificial Intelligence has expanded its influence far beyond traditional boundaries in our society. One prominent application of artificial intelligence is the use of large language models, which have transcended their initial roles in high-tech industries and academic research and are now actively utilized by individual users. These models have continually improved over the years in their generative capabilities and performance across numerous tasks. However, they still pose a persistent risk of reproducing biases and stereotypes. Previous research has predominantly focused on quantitatively measuring biases in these large language models. In this study, we seek to assess not just the presence of bias itself, but the perception of stereotypes by these models via in-depth exploration of their responses. We demonstrate how the computational grounded theory framework, which integrates qualitative and quantitative approaches, can be applied in this context to assess the conceptualization of stereotypes. Furthermore, we contrast language model results with a survey of 400 human participants who also completed similar prompts as the model in order to understand people’s perception of gender stereotypes. The results indicate substantial similarities between language model and human perceptions of stereotypes, highlighting that a model’s perception stems from societal perception of stereotypes.
Background The COVID-19 pandemic has had a profound impact on societies and economies around the globe, and experts warn about the potential for similar crises in the future. Risk communication theories underscore that while the potential for harm is objective, risk perception is a subjective, socially derived interpretation. While there is broad literature on the social construction of risk, fewer studies examine the role of communities—online or offline—in developing and reinforcing distinct interpretations of the same risk event. During COVID-19, online communities emerged as individuals sought to make sense of the ongoing crisis. These communities offer an opportunity to gain important insights into how concerned public collectively interprets risk and create group identities, informing public health strategies. Objective This study aims to, first, explore how online communities with distinct ideologies create and reinforce divergent conceptualizations of risk and, second, identify the role of group identity in shaping the development and communication of risk interpretations in these communities. Methods We used computational grounded theory, a multistep approach that includes pattern detection, hypothesis testing, and pattern confirmation to explore interpretations of risk and group identity in about 500,000 comments from the subreddits r/LockdownSkepticism and r/Masks4All. In the pattern detection step of this study, we grouped comments by the post they were made on and then used latent Dirichlet allocation topic modeling to identify 10 topics based on the frequency of term co-occurrence. In the hypothesis refinement step, we conducted a qualitative thematic analysis of 30 posts under each topic using Braun and Clarke’s approach. Finally, in the pattern confirmation step, we trained a Word2Vec word embedding model to validate emerging themes from the second step. Results This study found that Masks4All and LockdownSkepticism both centered risk in their conversations, but with divergent concerns related to the threat of COVID-19. While Masks4All emphasized the threat to health, LockdownSkepticism questioned the necessity of preventive measures and focused on other risks: the threat to the economy, educational disruptions, and social isolation. Group identity was also found to shape collective meanings around risk, as community members in both subreddits affirmed group positions and condemned the outgroup. Conclusions This study demonstrated that while both communities were concerned about COVID-19, their perceptions of risk focused on different aspects of the same risk event. This underscores the need for targeted interventions that engage with divergent ideologies and value systems across groups of people.
Introduction The scientific study of racism as a root cause of health inequities has been hampered by the policies and practices of medical journals. Monitoring the discourse around racism and health inequities (i.e., racism narratives) in scientific publications is a critical aspect of understanding, confronting, and ultimately dismantling racism in medicine. A conceptual framework and multi-level construct is needed to evaluate the changes in the prevalence and composition of racism over time and across journals. Objective To develop a framework for classifying racism narratives in scientific medical journals. Methods We constructed an initial set of racism narratives based on an exploratory literature search. Using a computational grounded theory approach, we analyzed a targeted sample of 31 articles in four top medical journals which mentioned the word ‘racism’. We compiled and evaluated 80 excerpts of text that illustrate racism narratives. Two coders grouped and ordered the excerpts, iteratively revising and refining racism narratives. Results We developed a qualitative framework of racism narratives, ordered on an anti-racism spectrum from impeding anti-racism to strong anti-racism, consisting of 4 broad categories and 12 granular modalities for classifying racism narratives. The broad narratives were “dismissal,” “person-level,” “societal,” and “actionable.” Granular modalities further specified how race-related health differences were related to racism (e.g., natural, aberrant, or structurally modifiable). We curated a “reference set” of example sentences to empirically ground each label. Conclusion We demonstrated racism narratives of dismissal, person-level, societal, and actionable explanations within influential medical articles. Our framework can help clinicians, researchers, and educators gain insight into which narratives have been used to describe the causes of racial and ethnic health inequities, and to evaluate medical literature more critically. This work is a first step towards monitoring racism narratives over time, which can more clearly expose the limits of how the medical community has come to understand the root causes of health inequities. This is a fundamental aspect of medicine’s long-term trajectory towards racial justice and health equity.
We describe a project in which we develop a text‐based measure of HI in the context of scholarly communication using corpora of scientific publications. The data and analytic approach we use will circumvent known concerns with self‐reported data on humility levels and will be calculable on a large scale. We use a computational grounded theory approach to develop a text‐based measure of HI. We draw from an annotated corpus of scientific articles in economics, psychology, and sociology (2010–2023), generating three supra‐dimensions of HI (Epistemic, Rhetorical, and Transparent) and several novel sub‐codes of HI. We present our initial analysis with a focus on the three dimensions of HI derived from a computational grounded theory approach. The text‐based measure helps us better understand how contextual factors shape HI and contribute to mixed methods in information science research.
Human Augmentation (HA) technologies, such as Brain-Computer Interfaces (BCIs), neurostimulation devices, and microchip implants, are increasingly discussed in online pioneer communities, where early adopters shape imaginaries of technologically mediated human futures. As part of the broader process of digitalization, HA technologies contribute to the platformization of the human body. While these technologies remain experimental, transhumanists and biohackers engage with them as tools for self-enhancement, body modification, and posthuman evolution. These imaginaries are critical to understanding future adoption, yet remain underexplored in scholarly literature. This study applies computational grounded theory (CGT) to analyze discussions on Reddit, identifying emerging sociotechnical imaginaries of HA technologies. Using BERTopic, a transformer-based topic modeling approach, we extract thematic structures from a dataset of 1,503 posts and 60,327 comments spanning 2008–2025. Using BERTopic, a transformer-based topic modeling approach, we extract thematic structures from a dataset of 1,503 posts and 60,327 comments spanning 2008–2025. The imaginaries are then defined through qualitative analysis and iterative refinement of the model, ensuring deeper contextual grounding. Preliminary findings reveal three key dimensions of cyborg imaginaries: (1) Beliefs, including aspirations such as immortality and concerns over job automation; (2) Practices, particularly cognitive and sensory augmentation; and (3) Technological Advances, with discussions centered on BCIs, neural implants, and cybernetic enhancements. This extended abstract presents initial results, contributing to broader discussions on digitalization, human-technology integration, and cyborgization.
No abstract available
Internet technologies have created unprecedented opportunities for people to come together and through their collective effort generate large amounts of data about human behavior. With the increased popularity of grounded theory, many researchers have sought to use ever-increasingly large datasets to analyze and draw patterns about social dynamics. However, the data is simply too big to enable a single human to derive effective models for many complex social phenomena. Computational methods offer a unique opportunity to analyze a wide spectrum of sociological events by leveraging the power of artificial intelligence. Within the human factors community, machine learning has emerged as the dominant AI-approach to deal with big data. However, along with its many benefits, machine learning has introduced a unique challenge: interpretability. The models of macro-social behavior generated by AI are so complex that rarely can they translated into human understanding. We propose a new method to conduct grounded theory research by leveraging the power of machine learning to analyze complex social phenomena through social network analysis while retaining interpretability as a core feature.
Computational approaches have grown in prominence amidst advancements in new media and technologies and ever-increasing amounts of digital data. This article critically examines these automated techniques, especially the analytical affordances and concerns that such methods introduce to the study of online migrant and mobility discourses. The paper further argues for a mixed methodology anchored on social representations theory—a contextually sensitive framework that enables reflexive use of computational approaches, i.e., quantitatively analyze but also explore different layers of cultural and linguistic meanings in online diasporic interactions. With Filipino migrants in Germany as a case study and partner community, the study then demonstrates the combined application of topic modeling and ethnographically inspired qualitative analysis on migrant posts in Facebook. The findings are discussed in the form of a cultural reflection on Filipino values and expectations and an advocacy for mixed methodologies grounded on critical, social, and practice-oriented theories.
This article presents a computational approach to examining immigrant incorporation through shifts in the social “mainstream.” Analyzing a historical corpus of American etiquette books, texts from 1922–2017 describing social norms, we identify mainstream shifts related to long-standing groups which once were and may currently still be seen as immigrant outsiders in the United States: Catholic, Chinese, Irish, Italian, Jewish, Mexican, and Muslim groups. The analysis takes a computational grounded theory approach, combining qualitative readings and computational text analyses. Using word embeddings, we operationalize the chosen groups as focal group concepts. We extract sections of text that are salient to the focal group concepts to create group-specific text corpora. Two computational approaches make it possible to examine mainstream shifts in these corpora. First, we use sentiment analysis to observe the positive sentiment in each corpus and its change over time. Second, we observe changes in each corpus's position on a semantic dimension represented by the poles of “strange” and “normal.” The results indicate mainstream shifts through increases in positive sentiment and movement from strange to normal over time for most of the group-specific corpora. These research techniques can be adapted to other studies of social sentiment and symbolic inclusion.
Recent instances of lethal mass violence have been linked to digital communities dedicated to misogynist and sexist ideologies. These forums often begin with discussions of more conventional or mainstream ideas, raising the question about the process through which these communities transform from relatively benign to extremist. This article presents a study of the Reddit incel community, active from mid-2016 to its ban in late 2017, which evolved from a self-help forum to a hub for extremist ideologies. We use computational grounded theory to deduce empirical patterns in forum composition, psychological states reflected in language use, and semantic content before refining and testing an interactional process that explains this change: a shift away from drawing on real-world experiences in discussion toward a greater reliance on cognitively simple symbols of group membership. This shift, in turn, leads to more discussions centered on deviant ideology. The results confirm that understanding the dynamics of conversation—specifically, how ideas are interpreted, reinforced, and amplified in recurrent, person-to-person interactions—is crucial for understanding cultural change in digital communities. Implications for sociology of groups, culture, and interactions in digital spaces are discussed.
Anchored in framing theory and the public arenas model, this study investigates the representation and temporal evolution of genetic modification (GM) risk frames on Chinese social media. Through analysis of public discussions on GM risks from 2010 to 2020, and utilising an integration of unsupervised machine learning and computational grounded theory methodologies, this study develops a categorisation schema of 13 GM risk frames. These frames span the full lifecycle of risk social construction, from identification and definition through assessment, social negotiation, attribution, impact evaluation, to management and mitigation. The findings reveal that GM risk discourses are multifaceted, with systematic differences in frame adoption among social actors including government agencies, experts, media outlets, and the general public. The study demonstrates that GM risk frame evolution aligns closely with public attention cycles, exhibiting three distinct patterns: fluctuating decline, punctuated equilibrium, and fluctuating increase. Additionally, it is found that key events or crises catalyse both quantitative changes in frame prominence and qualitative transformations in how GM risks are framed.
The current surge in Artificial Intelligence (AI) interest, reflected in heightened media coverage since 2009, has sparked significant debate on AI's implications for privacy, social justice, workers' rights, and democracy. The media plays a crucial role in shaping public perception and acceptance of AI technologies. However, research into how AI appears in media has primarily focused on anglophone contexts, leaving a gap in understanding how AI is represented globally. This study addresses this gap by analyzing 3,560 news articles from Brazilian media published between July 1, 2023, and February 29, 2024, from 13 popular online news outlets. Using Computational Grounded Theory (CGT), the study applies Latent Dirichlet Allocation (LDA), BERTopic, and Named-Entity Recognition to investigate the main topics in AI coverage and the entities represented. The findings reveal that Brazilian news coverage of AI is dominated by topics related to applications in the workplace and product launches, with limited space for societal concerns, which mostly focus on deepfakes and electoral integrity. The analysis also highlights a significant presence of industry-related entities, indicating a strong influence of corporate agendas in the country's news. This study underscores the need for a more critical and nuanced discussion of AI's societal impacts in Brazilian media.
LGBTQ+ (lesbian, gay, bisexual, transgender, queer) individuals are at significantly higher risk for mental health challenges than the general population. Social media and online communities provide avenues for LGBTQ+ individuals to have safe, candid, semi-anonymous discussions about their struggles and experiences. We study minority stress through the language of disclosures and self-experiences on the r/lgbt Reddit community. Drawing on Meyer's minority stress theory, and adopting a combined qualitative and computational approach, we make three primary contributions, 1) a theoretically grounded codebook to identify minority stressors across three types of minority stress-prejudice events, perceived stigma, and internalized LGBTphobia, 2) a machine learning classifier to scalably identify social media posts describing minority stress experiences, that achieves an AUC of 0.80, and 3) a lexicon of linguistic markers, along with their contextualization in the minority stress theory. Our results bear implications to influence public health policy and contribute to improving knowledge relating to the mental health disparities of LGBTQ+ populations. We also discuss the potential of our approach to enable designing online tools sensitive to the needs of LGBTQ+ individuals.
ABSTRACT This paper explores online bottom-up populism in China by examining the discursive logics of populism that emerge within expressions of populist discontent. Through a conceptualization of the affordances of social media that considers what they enable alongside what they constrain, it uses a computational grounded theory approach to examine individuals’ posts and the use of hashtags in online communication on Sina Weibo around the #DrivingIntoThePalaceMuseum case. Through its analysis, three discursive logics of online populism are identified: antagonistic logic, polarization logic and protest logic. However, while the affordances of social media enable populist discourse polarization, they also enable “depolarization” through the government’s censorship mechanisms. This results in a dynamic bottom-up populism articulation that reflects an awareness of China’s censorship mechanisms. Within the Chinese media environment, this functions as a “pressure valve” releasing the buildup of populist sentiment in a Chinese “social volcano.”
To those involved in discussions about rigor, reproducibility, and replication in science, conversation about the “reproducibility crisis” appear ill-structured. Seemingly very different issues concerning the purity of reagents, accessibility of computational code, or misaligned incentives in academic research writ large are all collected up under this label. Prior work has attempted to address this problem by creating analytical definitions of reproducibility. We take a novel empirical, mixed methods approach to understanding variation in reproducibility discussions, using a combination of grounded theory and correspondence analysis to examine how a variety of authors narrate the story of the reproducibility crisis. Contrary to expectations, this analysis demonstrates that there is a clear thematic core to reproducibility discussions, centered on the incentive structure of science, the transparency of methods and data, and the need to reform academic publishing. However, we also identify three clusters of discussion that are distinct from the main body of articles: one focused on reagents, another on statistical methods, and a final cluster focused on the heterogeneity of the natural world. Although there are discursive differences between scientific and popular articles, we find no strong differences in how scientists and journalists write about the reproducibility crisis. Our findings demonstrate the value of using qualitative methods to identify the bounds and features of reproducibility discourse, and identify distinct vocabularies and constituencies that reformers should engage with to promote change.
Background Since COVID-19 vaccines became broadly available to the adult population, sharp divergences in uptake have emerged along partisan lines. Researchers have indicated a polarized social media presence contributing to the spread of mis- or disinformation as being responsible for these growing partisan gaps in uptake. Objective The major aim of this study was to investigate the role of influential actors in the context of the community structures and discourse related to COVID-19 vaccine conversations on Twitter that emerged prior to the vaccine rollout to the general population and discuss implications for vaccine promotion and policy. Methods We collected tweets on COVID-19 between July 1, 2020, and July 31, 2020, a time when attitudes toward the vaccines were forming but before the vaccines were widely available to the public. Using network analysis, we identified different naturally emerging Twitter communities based on their internal information sharing. A PageRank algorithm was used to quantitively measure the level of “influentialness” of Twitter accounts and identifying the “influencers,” followed by coding them into different actor categories. Inductive coding was conducted to describe discourses shared in each of the 7 communities. Results Twitter vaccine conversations were highly polarized, with different actors occupying separate “clusters.” The antivaccine cluster was the most densely connected group. Among the 100 most influential actors, medical experts were outnumbered both by partisan actors and by activist vaccine skeptics or conspiracy theorists. Scientists and medical actors were largely absent from the conservative network, and antivaccine sentiment was especially salient among actors on the political right. Conversations related to COVID-19 vaccines were highly polarized along partisan lines, with “trust” in vaccines being manipulated to the political advantage of partisan actors. Conclusions These findings are informative for designing improved vaccine information communication strategies to be delivered on social media especially by incorporating influential actors. Although polarization and echo chamber effect are not new in political conversations in social media, it was concerning to observe these in health conversations on COVID-19 vaccines during the vaccine development process.
No abstract available
No abstract available
This paper presents a practical guide to machine learning–assisted visual content analysis for social scientists. Combining machine automation with human expertise and reflexivity, the proposed methodological framework bridges the gap between computer vision and social research. Our custom approach combines inductive, deductive, and abductive logics of scientific inquiry and consists of three complementary steps: (a) Pattern exploration—employing unsupervised learning to explore visual patterns within image datasets; (b) Theory-driven image classification—utilizing supervised learning with convolutional neural networks to systematically label visual content; and (c) Context-sensitive interpretation—to provide critical and creative engagement with the patterns identified in the previous steps. We illustrate these three steps, and their various combinations, through empirical examples from a study of visuality in digital diplomacy, and critically discuss the epistemological implications of using machine learning as a method in visual social research.
As online communication data continues to grow, manual content analysis, which is frequently employed in media studies within the social sciences, faces challenges in terms of scalability, efficiency, and coding scope. Automated machine learning can address these issues, but it often functions as a black box, offering little insight into the features driving its predictions. This lack of interpretability limits its application in advancing social science communication research and fostering practical outcomes. Here, explainable AI offers a solution that balances high prediction accuracy with interpretability. However, its adoption in social science communication studies remains limited. This study illustrates tensor decomposition—specifically, PARAFAC2—for media scholars as an interpretable machine learning method for analyzing high-dimensional communication data. By transforming complex datasets into simpler components, tensor decomposition reveals the nuanced relationships among linguistic features. Using a labeled spam review dataset as an illustrative example, this study demonstrates how the proposed approach uncovers patterns overlooked by traditional methods and enhances insights into language use. This framework bridges the gap between accuracy and explainability, offering a robust tool for future social science communication research.
Machine learning (ML) has become increasingly influential to human society, yet the primary advancements and applications of ML are driven by research in only a few computational disciplines. Even applications that affect or analyze human behaviors and social structures are often developed with limited input from experts outside of computational fields. Social scientists—experts trained to examine and explain the complexity of human behavior and interactions in the world—have considerable expertise to contribute to the development of ML applications for human-generated data, and their analytic practices could benefit from more human-centered ML methods. Although a few researchers have highlighted some gaps between ML and social sciences [51, 57, 70], most discussions only focus on quantitative methods. Yet many social science disciplines rely heavily on qualitative methods to distill patterns that are challenging to discover through quantitative data. One common analysis method for qualitative data is qualitative coding. In this article, we highlight three challenges of applying ML to qualitative coding. Additionally, we utilize our experience of designing a visual analytics tool for collaborative qualitative coding to demonstrate the potential in using ML to support qualitative coding by shifting the focus to identifying ambiguity. We illustrate dimensions of ambiguity and discuss the relationship between disagreement and ambiguity. Finally, we propose three research directions to ground ML applications for social science as part of the progression toward human-centered machine learning.
The study of how cognition and society interact is a complex endeavor that demands multiple methods and tools. Yet research in social cognition has only begun to capitalize on unsupervised machine learning (UML) tools that can uncover hidden patterns in data. In this tutorial, we introduce UML as a complementary approach to traditional statistical methods. We illustrate four methods (K-means clustering, Density-Based Clustering of Applications With Noise [DBSCAN], Principal Component Analysis [PCA], and Market Basket Analysis) applied to data from Project Implicit and the Implicit Association. We show how UML can identify patterns and relationships that conventional methods might overlook. Throughout, we provide clear (and openly available) code and highlight important researcher decision points in implementing UML in social cognition work. By bringing the advances of UML into social cognition, we will be better equipped to tackle larger, more diverse, or multilevel data sets that reveal the complexities of our social world.
Computational power and big data have created new opportunities to explore and understand the social world. A special synergy is possible when social scientists combine human attention to certain aspects of the problem with the power of algorithms to automate other aspects of the problem. We review selected exemplary applications where machine learning amplifies researcher coding, summarizes complex data, relaxes statistical assumptions, and targets researcher attention to further social science research. We aim to reduce perceived barriers to machine learning by summarizing several fundamental building blocks and their grounding in classical statistics. We present a few guiding principles and promising approaches where we see particular potential for machine learning to transform social science inquiry. We conclude that machine learning tools are increasingly accessible, worthy of attention, and ready to yield new discoveries for social research.
At its most fundamental, ‘‘social science is the process of creating generalizable knowledge that explains or predicts societal patterns’’ (p. 264). Text as Data: A New Framework for Machine Learning and the Social Sciences seeks to provide readers with a model to do just this, but with a relatively untapped form of data, at least for the social sciences. Using text as data happens frequently in the computer science world, and Justin Grimmer, Margaret E. Roberts, and Brandon M. Stewart, the authors of this text, seek to extend known computer science methodology to align with social science methodological principles. The authors bridge this gap by applying our methodological models (some of them, at least) to this novel, timerelevant, and expanding form of data. This is an ambitious text that, at different stages, provides critical insight for undergraduates, graduate students across the social sciences, and practitioners. Text as Data systematically walks readers through the research process, from selection and representation to discovery to measurement and, finally, to inference and prediction. In the first section of the text, they concisely detail this model of research and the justifications behind it for the more novice scholars. The text then introduces each stage of this research process, laying out the assumptions and best practices informing this specific approach with text as data. Common to all of these introductory chapters is the emphasis on the crucial role of the human researcher. The authors do not shy away from a common fear in analyses with ‘‘big data,’’ that human work is becoming obsolete and theory is disappearing. Instead, they make a compelling case that although the analytic processes necessitated by ‘‘big data’’ may seem (and sometimes even be named) as if computers are operating independently of theory and of humans, the social science project will only succeed with the continued and constant engagement of the human-generated ideas behind the projects. Following each of these introductory chapters that adeptly frame the overall endeavor and lay out the novel application of research methods to text data, the authors present a thorough overview of the many ways in which practitioners can pursue research with text data. Here, the authors present work that has already been done in the social sciences (e.g., authorship of the Federalist papers, identifying a model of Congressional ideology from press releases, authorship and tone of tweets from former President Trump) and also work through one or more basic algorithms to link the reader to the algebraic and mathematical progressions that provide the foundation for machine learning (or other similarly opaque procedures). Concluding these detailed presentations of possible steps through the research process, the text progresses to the next step in the research process (i.e., from measurement to inference), clearly linking and overlapping these processes where appropriate. Often methodological training in the social sciences bends in the direction of either inductive or deductive research. Researchers seek, often going to extreme measures, to justify their conceptualization, operationalization, modeling, and interpretation choices prior to embarking on analytical procedures in order to avoid questions of over-fitting, p-hacking, and the like. Alternatively, researchers embark on scholarly pursuits to build theory emerging from their research sites and informants, often utilizing only qualitative techniques to do so. Especially in elementary methodological training, these two tracks are distinct and, sometimes, juxtaposed as opposites. Not so in this text, where the authors use the emergent and exciting field of text data to emphasize the importance of iterative and sequential scholarship. The authors showcase across these four stages of the research process the opportunities for building a comprehensive research agenda that celebrates multiple approaches and Reviews 347
What attitudes, values, and beliefs serve as key markers of cultural change? To answer this question, we examined 221,485 respondents from the World Values Survey, a multiwave cross-country survey of people's attitudes, values, and beliefs. We trained a machine learning model to classify respondents into seven waves (i.e., periods). Once trained, the machine learning model identified a separate group of 24,611 respondents' wave with a balanced accuracy of 77%. We then queried the model to identify the attitudes, values, and beliefs that contributed the most to its classification decisions, and therefore, served as markers of cultural change. These included religiosity, social attitudes, political attitudes, independence, life satisfaction, Protestant work ethic, and prosociality. Although past research in cultural change has discussed decreasing religiosity and increasing liberalism and independence, it has not yet identified Protestant work ethic, political orientation, and prosociality as values relevant to cultural change. Thus, the current research points to new directions for future research on cultural change that might not be evident from either a deductive or an inductive approach. This research illustrates that the abductive approach of machine learning, which focuses on the most likely explanations for an outcome, can help generate novel insights. (PsycInfo Database Record (c) 2021 APA, all rights reserved).
Background Among racial and ethnic minority groups, the risk of HIV infection is an ongoing public health challenge. Pre-exposure prophylaxis (PrEP) is highly effective for preventing HIV when taken as prescribed. However, there is a need to understand the experiences, attitudes, and barriers of PrEP for racial and ethnic minority populations and sexual minority groups. Objective This infodemiology study aimed to leverage big data and unsupervised machine learning to identify, characterize, and elucidate experiences and attitudes regarding perceived barriers associated with the uptake and adherence to PrEP therapy. This study also specifically examined shared experiences from racial or ethnic populations and sexual minority groups. Methods The study used data mining approaches to collect posts from popular social media platforms such as Twitter, YouTube, Tumblr, Instagram, and Reddit. Posts were selected by filtering for keywords associated with PrEP, HIV, and approved PrEP therapies. We analyzed data using unsupervised machine learning, followed by manual annotation using a deductive coding approach to characterize PrEP and other HIV prevention–related themes discussed by users. Results We collected 522,430 posts over a 60-day period, including 408,637 (78.22%) tweets, 13,768 (2.63%) YouTube comments, 8728 (1.67%) Tumblr posts, 88,177 (16.88%) Instagram posts, and 3120 (0.6%) Reddit posts. After applying unsupervised machine learning and content analysis, 785 posts were identified that specifically related to barriers to PrEP, and they were grouped into three major thematic domains: provider level (13/785, 1.7%), patient level (570/785, 72.6%), and community level (166/785, 21.1%). The main barriers identified in these categories included those associated with knowledge (lack of knowledge about PrEP), access issues (lack of insurance coverage, no prescription, and impact of COVID-19 pandemic), and adherence (subjective reasons for why users terminated PrEP or decided not to start PrEP, such as side effects, alternative HIV prevention measures, and social stigma). Among the 785 PrEP posts, we identified 320 (40.8%) posts where users self-identified as racial or ethnic minority or as a sexual minority group with their specific PrEP barriers and concerns. Conclusions Both objective and subjective reasons were identified as barriers reported by social media users when initiating, accessing, and adhering to PrEP. Though ample evidence supports PrEP as an effective HIV prevention strategy, user-generated posts nevertheless provide insights into what barriers are preventing people from broader adoption of PrEP, including topics that are specific to 2 different groups of sexual minority groups and racial and ethnic minority populations. Results have the potential to inform future health promotion and regulatory science approaches that can reach these HIV and AIDS communities that may benefit from PrEP.
Background Prior research documents that India has the greatest number of girls married as minors of any nation in the world, increasing social and health risks for both these young wives and their children. While the prevalence of child marriage has declined in the nation, more work is needed to accelerate this decline and the negative consequences of the practice. Expanded targets for intervention require greater identification of these targets. Machine learning can offer insight into identification of novel factors associated with child marriage that can serve as targets for intervention. Methods We applied machine learning methods to retrospective cross-sectional survey data from India on demographics and health, the nationally-representative National Family Health Survey, conducted in 2015–16. We analyzed data using a traditional regression model, with child marriage as the dependent variable, and 4000+ variables from the survey as the independent variables. We also used three commonly used machine learning algorithms– Least Absolute Shrinkage and Selection Operator (lasso) or L-1 regularized logistic regression models; L2 regularized logistic regression or ridge models; and neural network models. Finally, we developed and applied a novel and rigorous approach involving expert qualitative review and coding of variables generated from an iterative series of regularized models to assess thematically key variable groupings associated with child marriage. Findings Analyses revealed that regularized logistic and neural network applications demonstrated better accuracy and lower error rates than traditional logistic regression, with a greater number of features and variables generated. Regularized models highlight higher fertility and contraception, longer duration of marriage, geographic, and socioeconomic vulnerabilities as key correlates; findings shown in prior research. However, our novel method involving expert qualitative coding of variables generated from iterative regularized models and resultant thematic generation offered clarity on variables not focused upon in prior research, specifically non-utilization of health system benefits related to nutrition for mothers and infants. Interpretation Machine learning appears to be a valid means of identifying key correlates of child marriage in India and, via our innovative iterative thematic approach, can be useful to identify novel variables associated with this outcome. Findings related to low nutritional service uptake also demonstrate the need for more focus on public health outreach for nutritional programs tailored to this population.
This article proposes an automated methodology for the analysis of online political discourse. Drawing from the discourse quality index (DQI) by Steenbergen et al., it applies a machine learning–based quantitative approach to measuring the discourse quality of political discussions online. The DelibAnalysis framework aims to provide an accessible, replicable methodology for the measurement of discourse quality that is both platform and language agnostic. The framework uses a simplified version of the DQI to train a classifier, which can then be used to predict the discourse quality of any non-coded comment in a given political discussion online. The objective of this research is to provide a systematic framework for the automated discourse quality analysis of large datasets and, in applying this framework, to yield insight into the structure and features of political discussions online.
Abstract Sociolinguistics researchers can use sociolinguistic auto-coding (SLAC) to predict humans’ hand-codes of sociolinguistic data. While auto-coding promises opportunities for greater efficiency, like other computational methods there are inherent concerns about this method’s fairness – whether it generates equally valid predictions for different speaker groups. Unfairness would be problematic for sociolinguistic work given the central importance of correlating speaker groups to differences in variable usage. The current study examines SLAC fairness through the lens of gender fairness in auto-coding Southland New Zealand English non-prevocalic /r/. First, given that there are multiple, mutually incompatible definitions of machine learning fairness, I argue that fairness for SLAC is best captured by two definitions (overall accuracy equality and class accuracy equality) corresponding to three fairness metrics. Second, I empirically assess the extent to which SLAC is prone to unfairness; I find that a specific auto-coder described in previous literature performed poorly on all three fairness metrics. Third, to remedy these imbalances, I tested unfairness mitigation strategies on the same data; I find several strategies that reduced unfairness to virtually zero. I close by discussing what SLAC fairness means not just for auto-coding, but more broadly for how we conceptualize variation as an object of study.
No abstract available
Abstract Organizational factors, as literature indicates, are significant contributors to risk in high-consequence industries. Therefore, building a theoretical framework equipped with reliable modeling techniques and data analytics to quantify the influence of organizational performance on risk scenarios is important for improving realism in Probabilistic Risk Assessment (PRA). The Socio-Technical Risk Analysis (SoTeRiA) framework theoretically connects the structural (e.g., safety practices) and behavioral (e.g., safety culture) aspects of an organization with PRA. An Integrated PRA (I-PRA) methodological framework is introduced to operationalize SoTeRiA in order to quantify the incorporation of underlying organizational failure mechanisms into risk scenarios. This research focuses on the Data-Theoretic module of I-PRA, which has two sub-modules: (i) DT-BASE: developing detailed causal relationships in SoTeRiA, grounded on theories and equipped with a semi-automated baseline quantification utilizing information extracted from academic articles, industry procedures, and regulatory standards, and (ii) DT-SITE: conducting automated data extraction and inference methods to quantify SoTeRiA causal elements based on site-specific event databases and by Bayesian updating of the DT-BASE baseline quantification. A case study demonstrates the quantification of a nuclear power plant's organizational “training” causal model, which is associated with the training/experience in Human Reliability Analysis, along with a sensitivity analysis to identify critical factors.
In this paper, we present our research using intersectionality as a standpoint for the theoretical and methodological evaluation and design of a human-centered interactive web-based, and narrative focused prototype. This prototype system functions as a participatory toolkit through which marginalized users can submit, annotate and explore personal stories of discriminatory algorithmic encounters. Existing traditional fairness models tend to treat algorithmic bias as a technical defect - something to be corrected by recalibrating datasets or adjusting model weights. However, such approaches often ignore the deeper structural systemic sociohistorical roots of inequality and the experiences of those most harmed by technological systems. Our research challenges these reductive paradigms by providing and operationalizing intersectionality methodology as a new methodological framework for bias evaluation and equity grounded design. By incorporating the lived experiences of algorithmic violence towards Black women and others existing at the intersection of multiple identities into computing workflows, this study aims to shift the paradigms of traditional-algorithmic system design approach and epistemological computing research frameworks to design and develop a participatory rich narrative-based system that relies on the four strategies of intersectionality methodology. This scholarship employs participatory narrative-data collection, reflexive analytics, hybrid integration of natural language processing technology, researcher positionality and reflexive logging, data annotation, and feminist data visualizations. Intersectional ethics of contextualized research frameworks, integrated research methodological pluralism, and embedded reflexive documentation are the main contributions of this work. This dissertation not only contributes as a functional research tool but bridges the gap between critical theory and computation of bias-aware systems.
No abstract available
Qualitative analysis is critical to understanding human datasets in many social science disciplines. A central method in this process is inductive coding, where researchers identify and interpret codes directly from the datasets themselves. Yet, this exploratory approach poses challenges for meeting methodological expectations (such as ``depth''and ``variation''), especially as researchers increasingly adopt Generative AI (GAI) for support. Ground-truth-based metrics are insufficient because they contradict the exploratory nature of inductive coding, while manual evaluation can be labor-intensive. This paper presents a theory-informed computational method for measuring inductive coding results from humans and GAI. Our method first merges individual codebooks using an LLM-enriched algorithm. It measures each coder's contribution against the merged result using four novel metrics: Coverage, Overlap, Novelty, and Divergence. Through two experiments on a human-coded online conversation dataset, we 1) reveal the merging algorithm's impact on metrics; 2) validate the metrics'stability and robustness across multiple runs and different LLMs; and 3) showcase the metrics'ability to diagnose coding issues, such as excessive or irrelevant (hallucinated) codes. Our work provides a reliable pathway for ensuring methodological rigor in human-AI qualitative analysis.
Moral values play a fundamental role in how we evaluate information, make decisions, and form judgements around important social issues. Controversial topics, including vaccination, abortion, racism, and sexual orientation, often elicit opinions and attitudes that are not solely based on evidence but rather reflect moral worldviews. Recent advances in Natural Language Processing (NLP) show that moral values can be gauged in human-generated textual content. Building on the Moral Foundations Theory (MFT), this paper introduces MoralBERT, a range of language representation models fine-tuned to capture moral sentiment in social discourse. We describe a framework for both aggregated and domain-adversarial training on multiple heterogeneous MFT human-annotated datasets sourced from Twitter (now X), Reddit, and Facebook that broaden textual content diversity in terms of social media audience interests, content presentation and style, and spreading patterns. We show that the proposed framework achieves an average F1 score that is between 11% and 32% higher than lexicon-based approaches, Word2Vec embeddings, and zero-shot classification with large language models such as GPT-4 for in-domain inference. Domain-adversarial training yields better out-of domain predictions than aggregate training while achieving comparable performance to zero-shot learning. Our approach contributes to annotation-free and effective morality learning, and provides useful insights towards a more comprehensive understanding of moral narratives in controversial social debates using NLP.
We propose a new methodology for analysing hostile narratives by incorporating theories from Social Science into a Natural Language Processing (NLP) pipeline. Drawing upon Peace Research, we use the “Self-Other gradient” from the theory of cultural violence to develop a framework and methodology for analysing hostile narratives. As test data for this development, we contrast Hitler’s Mein Kampf and texts from the “War on Terror” era with non-violent speeches from Martin Luther King. Our experiments with this dataset question the explanatory value of numerical outputs generated by quantitative methods in NLP. In response, we draw upon narrative analysis techniques for the technical development of our pipeline. We experimentally show how analysing narrative clauses has the potential to generate outputs of improved explanatory value to quantitative methods. To the best of our knowledge, this work constitutes the first attempt to incorporate cultural violence into an NLP pipeline for the analysis of hostile narratives.
PurposeThe present study is about generating metadata to enhance thematic transparency and facilitate research on interview collections at the Research Documentation Centre, Centre for Social Sciences (TK KDK) in Budapest. It explores the use of artificial intelligence (AI) in producing, managing and processing social science data and its potential to generate useful metadata to describe the contents of such archives on a large scale.Design/methodology/approachThe authors combined manual and automated/semi-automated methods of metadata development and curation. The authors developed a suitable domain-oriented taxonomy to classify a large text corpus of semi-structured interviews. To this end, the authors adapted the European Language Social Science Thesaurus (ELSST) to produce a concise, hierarchical structure of topics relevant in social sciences. The authors identified and tested the most promising natural language processing (NLP) tools supporting the Hungarian language. The results of manual and machine coding will be presented in a user interface.FindingsThe study describes how an international social scientific taxonomy can be adapted to a specific local setting and tailored to be used by automated NLP tools. The authors show the potential and limitations of existing and new NLP methods for thematic assignment. The current possibilities of multi-label classification in social scientific metadata assignment are discussed, i.e. the problem of automated selection of relevant labels from a large pool.Originality/valueInterview materials have not yet been used for building manually annotated training datasets for automated indexing of scientifically relevant topics in a data repository. Comparing various automated-indexing methods, this study shows a possible implementation of a researcher tool supporting custom visualizations and the faceted search of interview collections.
Exposure to information sources of different types and modalities, such as social media, movies, scholarly reports, and interactions with other communities and groups can change a person's values as well as their knowledge and attitude towards various social phenomena. My doctoral research aims to analyze the effect of these stimuli on people and groups by applying mixed-method approaches that include techniques from natural language processing, close reading, and machine learning. The research leverages different types of user-generated texts (i.e., social media and customer reviews), and professionally-generated texts (i.e., scholarly publications and organizational documents) to study (1) the impact of information that aims to advance social good for individuals and society, and (2) the impact of social and individual biases on people's language use. This work contributes to advancing knowledge, theory and computational solutions relevant to the field of computational social science. The approaches and insights discussed can provide a better understanding of people's attitudes and judgments toward issues and events of general interest, which is necessary to develop solutions for minimizing biases, filter bubbles, and polarization while also improving the effectiveness of interpersonal and societal discourse.
Sociology of values: experience of building a taxonomy by using natural language analysis technology
Modern research in the field of sociology of science is becoming more complicated due to the constantly growing publication activity of authors. To track trends in sectoral sociology, scientists turn to scientometric methods, but they are not enough. Trends in the development of the sociology of values as a branch of sociology are the subject of the study. The purpose of the work is an assessment of the possibilities of using natural language analysis methods (NLP/NLA) for thematic and theoretical clustering of research in the sociology of values. The design of the study was quantitative and qualitative, it was carried out in two stages. At the first stage, 121 abstracts of a scientific articles were analyzed using text mining, after which their total array was divided into clusters. At the second stage, the results of machine clustering were examined by the method of qualitative text analysis, on the basis of which the limitations and capabilities of the NLP/NLA method were identified for solving the problem of clustering scientific texts. It was found that articles with a more conservative core of theoretical categories (gender studies, migration studies, the theory of globalism) are more amenable to clustering, while theories with a loosely structured and fluid theoretical core (theories using environmental terminology, theories of inequality) are much less amenable to explicit clustering. The results obtained allow us to form a new direction of work with large arrays of scientific texts, associated with their clustering using NLP/NLA. Building clusters enables researchers to work with all texts in a given subject area, and not just with the most cited ones. This, in turn, provides the visibility of all scientific ideas, including those that have not gained popularity/notability.
The U.S. has the highest rate of firearm-related deaths when compared to other industrialized countries. Violence particularly affects low-income, urban neighborhoods in cities like Chicago, which saw a 40% increase in firearm violence from 2014 to 2015 to more than 3,000 shooting victims. While recent studies have found that urban, gang-involved individuals curate a unique and complex communication style within and between social media platforms, organizations focused on reducing gang violence are struggling to keep up with the growing complexity of social media platforms and the sheer volume of data they present. In this paper, describe the Digital Urban Violence Analysis Approach (DUVVA), a collaborative qualitative analysis method used in a collaboration between data scientists and social work researchers to develop a suite of systems for decoding the high- stress language of urban, gang-involved youth. Our approach leverages principles of grounded theory when analyzing approximately 800 tweets posted by Chicago gang members and participation of youth from Chicago neighborhoods to create a language resource for natural language processing (NLP) methods. In uncovering the unique language and communication style, we developed automated tools with the potential to detect aggressive language on social media and aid individuals and groups in performing violence prevention and interruption.
The escalating frequency and complexity of natural disasters highlight the urgent need for deeper insights into how individuals and communities perceive and respond to risk information. Yet, conventional research methods—such as surveys, laboratory experiments, and field observations—often struggle with limited sample sizes, external validity concerns, and difficulties in controlling for confounding variables. These constraints hinder our ability to develop comprehensive models that capture the dynamic, context-sensitive nature of disaster decision-making. To address these challenges, we present a novel multi-stage simulation framework that integrates Large Language Model (LLM)-driven social–cognitive agents with well-established theoretical perspectives from psychology, sociology, and decision science. This framework enables the simulation of three critical phases—information perception, cognitive processing, and decision-making—providing a granular analysis of how demographic attributes, situational factors, and social influences interact to shape behavior under uncertain and evolving disaster conditions. A case study focusing on pre-disaster preventive measures demonstrates its effectiveness. By aligning agent demographics with real-world survey data across 5864 simulated scenarios, we reveal nuanced behavioral patterns closely mirroring human responses, underscoring the potential to overcome longstanding methodological limitations and offer improved ecological validity and flexibility to explore diverse disaster environments and policy interventions. While acknowledging the current constraints, such as the need for enhanced emotional modeling and multimodal inputs, our framework lays a foundation for more nuanced, empirically grounded analyses of risk perception and response patterns. By seamlessly blending theory, advanced LLM capabilities, and empirical alignment strategies, this research not only advances the state of computational social simulation but also provides valuable guidance for developing more context-sensitive and targeted disaster management strategies.
The convergence of expanded data availability, technological breakthroughs, and evolving societal dynamics has rekindled scholarly interest in culture and meaning. Accompanying computational methods are advancing rapidly across natural language processing, digital image processing, machine learning, neural networks, and artificial intelligence, which creates unprecedented opportunities for cultural meaning analysis. The introduction to this special issue advances abductive theorizing and reflexive rendering to explore the dynamic interplay of theory generation, measurement, and interpretation, highlighting common themes across its five articles. Our analysis unfolds in three stages. We begin with an abductive examination of the culture and meaning theories covered in the articles, exploring how they shape data selection and preparation, as well as research designs for addressing theoretical questions. Next, we engage in reflexive rendering, scrutinizing data characteristics and representations, methodological approaches, computational methods, and the theoretical artifacts that emerge to advance theory development. Finally, we discuss possible contributions, challenges, and concerns when organizational research examines culture and meaning using computational methods.
While reinforcement learning (RL) over chains of thought has significantly advanced language models in tasks such as mathematics and coding, visual reasoning introduces added complexity by requiring models to direct visual attention, interpret perceptual inputs, and ground abstract reasoning in spatial evidence. We introduce ViGoRL (Visually Grounded Reinforcement Learning), a vision-language model trained with RL to explicitly anchor each reasoning step to specific visual coordinates. Inspired by human visual decision-making, ViGoRL learns to produce spatially grounded reasoning traces, guiding visual attention to task-relevant regions at each step. When fine-grained exploration is required, our novel multi-turn RL framework enables the model to dynamically zoom into predicted coordinates as reasoning unfolds. Across a diverse set of visual reasoning benchmarks--including SAT-2 and BLINK for spatial reasoning, V*bench for visual search, and ScreenSpot and VisualWebArena for web-based grounding--ViGoRL consistently outperforms both supervised fine-tuning and conventional RL baselines that lack explicit grounding mechanisms. Incorporating multi-turn RL with zoomed-in visual feedback significantly improves ViGoRL's performance on localizing small GUI elements and visual search, achieving 86.4% on V*Bench. Additionally, we find that grounding amplifies other visual behaviors such as region exploration, grounded subgoal setting, and visual verification. Finally, human evaluations show that the model's visual references are not only spatially accurate but also helpful for understanding model reasoning steps. Our results show that visually grounded RL is a strong paradigm for imbuing models with general-purpose visual reasoning.
The ML-Schema, proposed by the W3C Machine Learning Schema Community Group, is a top-level ontology that provides a set of classes, properties, and restrictions for representing and interchanging information on machine learning algorithms, datasets, and experiments. It can be easily extended and specialized and it is also mapped to other more domain-specific ontologies developed in the area of machine learning and data mining. In this paper we overview existing state-of-the-art machine learning interchange formats and present the first release of ML-Schema, a canonical format resulted of more than seven years of experience among different research institutions. We argue that exposing semantics of machine learning algorithms, models, and experiments through a canonical format may pave the way to better interpretability and to realistically achieve the full interoperability of experiments regardless of platform or adopted workflow solution.
In the quest to align deep learning with the sciences to address calls for rigor, safety, and interpretability in machine learning systems, this contribution identifies key missing pieces: the stages of hypothesis formulation and testing, as well as statistical and systematic uncertainty estimation -- core tenets of the scientific method. This position paper discusses the ways in which contemporary science is conducted in other domains and identifies potentially useful practices. We present a case study from physics and describe how this field has promoted rigor through specific methodological practices, and provide recommendations on how machine learning researchers can adopt these practices into the research ecosystem. We argue that both domain-driven experiments and application-agnostic questions of the inner workings of fundamental building blocks of machine learning models ought to be examined with the tools of the scientific method, to ensure we not only understand effect, but also begin to understand cause, which is the raison d'être of science.
Intimate relationship stability is fundamental to human wellbeing, yet its quantitative assessment faces dual challenges: the inherent subjectivity of psychological constructs and the complexity of social ecosystems. Symmetry, as a fundamental structural feature of social interaction, plays a pivotal role in shaping relational dynamics. To address these limitations, this study proposes an innovative computational framework that integrates Fuzzy Set Theory with Social Network Analysis (SNA). The framework consists of two complementary components: (1) a psychologically grounded fuzzy assessment model that employs differentiated membership functions to transform discrete subjective ratings into continuous and interpretable relationship quality indices and (2) an enhanced Fuzzy C-Means (FCM) threat detection model that utilizes Weighted Mahalanobis Distance to accurately identify and cluster potential interference sources within social networks. Empirical validation using a simulated dataset—comprising typical characteristic samples from 10 couples—demonstrates that the proposed framework not only generates interpretable relationship diagnostics by correcting biases associated with traditional averaging methods, but also achieves high precision in threat identification. The results indicate that stable relationships exhibit greater symmetry in partner interactions, whereas threatened nodes display structural and behavioural asymmetry. This study establishes a rigorous mathematical paradigm—“Subjective Fuzzification → Multidimensional Feature Engineering → Intelligent Clustering”—for relationship science, thereby advancing the field from descriptive analysis toward data-driven, quantitative evaluation and laying a foundation for systematic assessment of relational health.
Comunalidad is the result of struggle and collective reflection emerging from the daily resistance and lived experiences of Indigenous peoples in the Sierra Norte of Oaxaca, Mexico (Maldonado, 2010). Like comunalidad, language reclamation is inherently relational and dynamic, deeply connected to identity, autonomy, and self-determination. In this essay, I explore comunalidad as theory, praxis, and pedagogy asking: How might comunalidad, as a relational and situated praxis, inform efforts toward language revitalization? Understanding the foundations and contextual factors that give rise to comunalidad is necessary to illuminate its intersections with and implications for language reclamation. I argue that comunalidad prompts us to conceive of language reclamation as a collective purpose—one that arises from, informs, and strengthens community relational practices and processes. As theory, comunalidad informs language reclamation; as praxis, it actively shapes both language and the process of reclaiming it. In this way, comunalidad emphasizes the need for situated pedagogies rooted in the daily praxis of the community. Overall, a lens of comunalidad provides insight into how language reclamation can function as a collective process and responsibility, building and strengthening community relationality, self-determination, and resistance.
: The expression and interpretation of sentiment within language are deeply intertwined with social and cultural contexts, influencing both linguistic theory and practical applications such as sentiment analysis in computer science. This paper explores sentiment through M.A.K. Halliday’s social semiotic framework, revealing how linguistic mechanisms and social contexts shape emotional communication. By applying Halliday’s principles, we enhance understanding of how sentiment is constructed and communicated across diverse contexts, and propose improvements for sentiment analysis tools in natural language processing. Our findings demonstrate that sentiment is a socially embedded phenomenon, reflecting and shaping interpersonal relationships and cultural values, and suggest new directions for integrating socio-semiotic insights into computational models.
Abstract Recently, there has been an increase in text analysis and natural language processing for both research and applied practice, especially to quantify emotions in language (i.e. sentiment analysis). Building on different theories of how language and emotions interact and how these interactions differ by gender and race/ethnicity, our study assesses for bias in the use of common sentiment analysis tools (e.g. AFINN, NRC). Specifically, we focus on measurement bias and predictive bias between genders and races/ethnicities using a novel real-world dataset of participant interviews in a simulated multi-day team-based competition. There was no evidence of measurement bias by race/ethnicity, but there were some biases by gender; specifically, females tended to express higher mean levels and more variance in emotion. There was no evidence of predictive bias by gender or race/ethnicity, though the latter was marginally significant. We hope this study paves the way towards more inclusive and accurate analytical tools to help researchers reduce demographic biases in their research. These findings also hold importance for organizations in employing equitable tools to better understand the needs of their diverse customers and employees.
随着城市的内部结构呈现日益复杂,传统的治理模式已经逐渐不能满足社会治理水平的需要,数字治理已经成为城市治理格局建设的重要方式。为探究县域政府数字治理视角下治理效能的影响因素,本文基于数字治理等相关理论与相关文献的梳理,对数字治理视角下的政府治理效能的影响因素进行了分析,探究了政府数字治理格局建设中各要素的作用。结果表明,影响其数字治理视角下的治理效能的因素主要有技术能力、组织架构、政策方案、理念转变、数据链条管理以及多元参与度六个方面。
本报告综合了计算扎根方法(CGT)从理论框架构建到跨学科应用的演进全貌。研究不仅确立了以“人机协同”和“溯因推理”为核心的方法论逻辑,还在社会话语分析、心理构念测量等领域实现了深度的实证落地。同时,学术界围绕计算工具的偏见、公平性及认识论严谨性展开了深入争论,推动了从单纯的自动化编码向更具反思性、解释性和交叉性的计算社会科学范式转型。