Workshop report: "Designing, building and using data banks of interactional corpora from a conversation analytic perspective"
CHORD talk in interaction
22/07/2024
From December 7th to 8th, 2023, researchers from Switzerland and abroad attended a hybrid workshop on “Designing, building and using data banks of interactional corpora from a conversation analytic perspective” at the University of Basel. The event was organised by Lorenza Mondada, professor of General and French Linguistics, and featured four guest speakers from the Leibniz Institute for the German Language (Leibniz-Institut für Deutsche Sprache, IDS). Arnulf Deppermann (head of the Pragmatics department), Silke Reineke (head of the Oral corpora section and of the project FOLK), Henrike Helmer (then head of the Oral corpora section), and Mark-Christoph Müller (head of the project Corpus technology for oral corpora) gave presentations on the design and analysis of large machine-readable corpora of spoken interaction, with a particular emphasis on the IDS corpus FOLK (Forschungs- und Lehrkorpus Gesprochenes Deutsch / Research and Teaching Corpus of Spoken German). In three presentations, which were complemented by several rounds of discussion with on site and online participants, the guest speakers outlined the challenges of creating and using interactional corpora. They addressed aspects specific to the perspective of both corpus builders and corpus users by reflecting on the peculiarities of audio/video recorded data of naturally occurring interaction, on questions of data stratification and representativity as well as on the digital corpus environment developed by IDS, its search tools, functionality, and usability.
The aim of this report is to summarise the presentations and to give room to some of the questions brought up in the discussions, which reflect the concerns of the scientific communities interested in FAIR sharing and reusing of audio/video recordings and transcripts of naturally occurring interaction.
Arnulf Deppermann: Using large machine-readable corpora for Interactional Linguistics and Conversation Analysis: Potentials and limitations
Arnulf Deppermann gave his presentation on the first workshop day and structured it in four parts which were followed by three discussion rounds. He addressed the benefits and caveats of using large linguistic corpora in Interactional Linguistics (IL) and Conversation Analysis (CA), the types of studies that can be conducted with them, and the appropriate interpretation of sociolinguistic metadata.
Deppermann commenced his presentation by defining the minimal features of a talk-in-interaction corpus (also see Schmidt, 2018). He outlined some of the practical advantages of sharing research data via a digital corpus platform, spanning all stages of the research process from data collection over analysis to community reuse. The benefits include a more efficient use of temporal and financial resources, the accessibility of research data from different locations, the searchability of larger quantities of data, and avenues of cumulative research such as comparative and diachronic investigations. Deppermann proceeded with describing the current constitution of FOLK (version 2.20), the possible roles such a corpus can play in an IL/CA study, and the types of studies that can be conducted with it. He presented three main types of IL/CA research questions, their corpus-based starting points, advantages, and caveats: i) studies focusing on a linguistic form, ii) studies focusing on an action (sequence), and iii) studies focusing on an interaction type. To illustrate approach (i), which starts from a linguistic form or property that can be retrieved by a corpus query, Deppermann reported a recent study about the German construction "was heißt X" (‘what does X mean’; Reineke, Deppermann & Schmidt 2023). He also used this study to trace the research process when using FOLK, which he divided into the steps “search & filter”, “save & export”, “IL/CA analysis”, “retranscription”, “coding”, and “dependencies check”. The last step refers to the detection of dependencies between linguistic phenomena and variables related to speakers, events or interaction types.
Another topic that took on great importance in Deppermann’s presentation and in the ensuing discussions were metadata and their appropriate use. Deppermann highlighted the relevance of metadata descriptors for a series of tasks in IL and CA research. They help users understand the peculiarities of participants and settings and avoid overgeneralisations. They also allow users to build virtual subcorpora from a large corpus and, in a mixed methods approach, to test for correlations. However, Deppermann explained, metadata should be used with circumspection: shallow explanations that arise from easily available descriptors cannot replace a detailed analysis of interactional conduct. Also, speaker- and event-related metadata classifications are essentially etic, i.e., they are stable categories defined by analysts rather than by the members of the social community themselves. They therefore may differ substantially from emic, mutable categories that are manifestly relevant for participants in an interaction and on which CA builds its analyses.
Triggered by Deppermann’s elaborations on metadata, some workshop participants used the discussions to express their concerns: Would it be safer not to refer to metadata at all to avoid any inadequate interpretation of the data? Which analytical and methodological skills does a researcher need to correctly interpret available metadata and take full advantage of them? Deppermann brought forward several reasons why the provision of metadata is necessary. The core of the problem however, he argued, is not that metadata are provided in the first place, but that they may be used incorrectly in ways that are difficult for corpus builders to foresee and prevent. At a community level, it is therefore important to monitor the information literacy, methodological skills, and epistemological awareness of researchers.
Deppermann concluded his presentation with some reflections on the effects of using large public corpora on the development of IL/CA. He contrasted three dangers with six benefits: Potential dangers could lie in i) a bias towards form-based studies as they are easily searchable, hence narrowing down our understanding of social interaction to linguistic practices; ii) losing the inductive “analytic mentality of CA” (Schenkein 1978) in favour of quantification and standardisation of data representation, procedures, and concepts; iii) privileging research with public data over recording data in closed settings (e.g., therapy, prisons, conflictual interaction, etc.). In contrast, the use of large corpora in IL/CA could i) enhance the cumulativity, comparability, and transparency of scientific research, ii) support the participation of researchers without financial and/or institutional means, iii) enable mixed methods approaches (quantitative IL/CA), iv) uncover more diversified phenomena, v) produce more generalisable findings, and vi) enhance awareness of contextual contingencies as opposed to indulging in premature universalism.
Silke Reineke: The Research and Teaching Corpus of Spoken German (FOLK): History, workflows and challenges in building a large corpus of audio- and videodata of social interaction
The second workshop day began with a presentation by Silke Reineke, who laid out the aim and stratification goals of the FOLK corpus and used it as a case study to illustrate workflows within the Oral corpora section of the IDS from data acquisition to data release, and to address challenges that arise when collecting and processing interactional data.
Reineke started her presentation by providing details about FOLK’s history and its current composition (version 2.20) as a large, diversified, and annotated corpus of audio and video recorded contemporary German talk-in-interaction. The corpus project began in 2008 with the aim to build a national, balanced, qualitatively representative, and synchronous resource for the study of spoken German in naturally occurring interactions. With the IDS’ standing as the “central scientific institution for the study and documentation of the contemporary usage and recent history of the German language” (Leibniz-Institut für Deutsche Sprache, The IDS Introduces Itself) the project benefits from federal and institutional funding that suffices to develop, maintain, and expand such a resource. The targeted users of the corpus are researchers, teachers, and students from the fields of CA, IL, and neighbouring disciplines such as Sociolinguistics, Variational Linguistics, and Media Linguistics.
At the time of the workshop, approximately 11 years after its initial release in 2012, FOLK comprised 347 hours of recordings. Of these, 190 were audio-only and 157 hours included video. The recordings originated from a variety of settings, including private, institutional, and public contexts and corresponded to 414 speech events. The corpus data amounted to approximately 3.3 million word tokens and included 1,317 speakers from various regions, ages, and educational backgrounds.
To build a resource with such volumes of data, a series of conceptual and theoretical decisions had to be made and practical problems had to be overcome. Reineke succinctly presented a workflow spanning from field access for researchers to data access for users. Prerequisites for data collection are the access to the studied field and the consent of the recorded speakers. The researchers then enrich the primary data with metadata and process the audio (and video) recordings for further utilisation. Based on the transcription convention cGAT (Schmidt et al. 2023) minimal transcripts of the verbal interactional conduct are produced and subjected to a first quality check. Subsequently, the transcripts are orthographically normalised before undergoing a second quality control. Once they have passed this second quality check, the data are transferred to corpus technology, where they are automatically lemmatised and part-of-speech tagged. They are then stored and published in the Database for Spoken German (Datenbank für Gesprochenes Deutsch, DGD), where they can be browsed, queried, and partially downloaded by registered users with an academic affiliation and who pursue a scientific purpose.
Reineke also elaborated on some challenges in this workflow, including limited resources, field access, data quality (both methodologically and technically), and data sensitivity. She also mentioned user expectations and the aim to ensure the growth of the corpus. These general challenges are further compounded by the need to adequately account for language variation and to comply with changing technical, legal, and policy constraints. She explained that FOLK’s stratification goals have been revised recently (cf. Kaiser 2019) and that the revised concept is currently being implemented.
Reineke concluded her presentation with a brief outlook on the FOLK project and general developments in the fields of CA and IL. In the context of a changing academic culture that increasingly demands to make research data digitally available for secondary use, she stressed the importance of education to guarantee an appropriate and ethically responsible use of such data and their online environments. She also advocated for the recognition of data sharing efforts as scientific contributions in the evaluation of grant proposals.
The discussion that followed Reineke’s presentation concentrated on questions of consent, data sensitivity, and access restrictions. Workshop participants inquired about advisable formulations to keep consent forms valid despite rapidly changing policies and technological developments, about the relationship between data protection and data censorship and about the possibility to share personal data within the scientific community when de-identification measures, consent forms, and access restrictions are well calibrated.
Henrike Helmer & Mark-Christoph Müller: Dissemination and searchability of spoken data in corpus research tools: Benefits and limitations for CA and IL research
Following Reineke’s presentation, which provided insights into the perspective of corpus builders, Henrike Helmer and Mark-Christoph Müller delivered a talk that focused on the technical background of FOLK and the DGD, the potential applications it offers to corpus users and ways of considering user requirements in the development of corpus research tools.
Helmer and Müller first elucidated the architecture and data model of the DGD – the database through which FOLK (and other AGD corpora) can be accessed. They compared the “one-size-fits-all” approach employed for the user interface of the DGD with ZuMult, an alternative corpus framework offering five prototype tools that enable target group-specific data access to FOLK and other corpora. ZuMult stands for “Zugänge zu multimodalen Korpora gesprochener Sprache” (‘access to multimodal corpora of spoken language’) and was first released in 2021. One of its tools is ZuRecht (“Zugang zur Recherche in Transkripten” / ‘search access to transcripts’), which allows to formulate complex searches using the corpus query language CQP.
Helmer and Müller contrasted the query options of ZuRecht with the search tools of the DGD to illustrate the difference between functionality and usability, and with it, the broader topic of user requirements and of how they can be considered when implementing new tools and functions. Functionality answers the question “What can I do with a tool?”, whereas usability focuses on the ease (or difficulty) of applying a tool. The invited speakers then contrasted two basic approaches to access corpus data that differ in terms of functionality and usability. Corpus browsing on the one hand, is a quite unconstrained approach, which does not require specific digital skills on the user’s side but necessitates a certain amount of time to get familiar with the data. Corpus queries with varying degrees of complexity, on the other hand, allow for a more focused way of engaging with data. They presuppose that users have in mind a more precise research question and possess technical skills to handle the provided corpus tools. As Helmer and Müller explained, the DGD interface offers both these options and allows users to browse its data by event, transcript, speaker, and media or to perform a full-text search for linguistic forms, annotated properties, event metadata and speaker metadata. Helmer and Müller came back to the example query for "was heißt X" (‘what does x mean’), which Deppermann had referred to on the first workshop day, and showed how the DGD search could be translated into a complex query in ZuRecht. Instead of formulating a structured, multi-step token query, narrowing down the context and filtering the metadata, they used the CQP query language of ZuRecht to gain the same results in just one step.
This example demonstrated the efficiency, for proficient users, of data access through ZuRecht. It also demonstrated a frequently observed trade-off between functionality and usability, since, as Müller explained, providing more powerful functionality may result in a reduction of general usability. The question of usability depends, anyway, on the users’ digital literacy and therefore can only ever be answered in relation to specific user groups. Knowing relevant user groups and their skills is therefore a necessary prerequisite for optimising corpus hosting services. Helmer and Müller then described how the IDS learnt about, and accommodated, the needs of FOLK users in the past. In 2016, for example, a survey was conducted (cf. Fandrych et al. 2016) that investigated the use of FOLK and enquired about desired functions. Additionally, IDS personnel get into contact with users through the support service of DGD and during training workshops. Supported by this exchange, several desiderata were registered and solved, e.g., by implementing functions for generating word lists, by offering the possibility to restrict a search to video recorded data or by adding new export functions for collections.
Helmer and Müller closed their presentation by outlining the types of CA/IL phenomena that can be found by means of corpus search tools. They pointed out that form-based phenomena are easier to search for than concepts that are not preferentially associated with a limited number of forms. In that context, they hinted at the topic of collection building and at the possible role of higher order annotations and user comments to support that phase of analysis. In the discussion that followed the presentation, some workshop participants enquired about the possibility of collaborative workspaces in the DGD, and their advantages and disadvantages were addressed. While this is already planned as a future scenario for the DGD, the invited speakers pointed out the complexity of integrating user-generated input to be stored within the DGD, since - in addition to quality check workflows and the need to keep the origin of data transparent - such functions will have to be evaluated also with regard to legal and ethical aspects.
Nina Profazi & Johanna Miecznikowski
Presentations
Deppermann, A. (2023, December 7-8). Using large machine-readable corpora for Interactional Linguistics and Conversation Analysis: Potentials and limitations [Workshop presentation]. Designing, building and using data banks of interactional corpora from a conversation analytic perspective, Basel, Switzerland.
Helmer, H. & Müller, M.C. (2023, December 7-8). Dissemination and searchability of spoken data in corpus research tools: Benefits and limitations for CA and IL research [Workshop presentation]. Designing, building and using data banks of interactional corpora from a conversation analytic perspective, Basel, Switzerland.
Reineke, S. (2023, December 7-8). The Research and Teaching Corpus of Spoken German (FOLK): History, workflows and challenges in building a large corpus of audio and video data of social interaction [Workshop presentation]. Designing, building and using data banks of interactional corpora from a conversation analytic perspective, Basel, Switzerland.
References
Fandrych, C., Frick, E., Hedeland, H., Iliash, A., Jettka, D., Meißner, C., Schmidt, T., Wallner, F., Weigert, K., & Westpfahl, S. (2016). User, who art thou? User profiling for oral corpus platforms. Paper presented at the 10th International Conference on Language Resources and Evaluation (LREC'16), Portorož, Slovenia. 280–287. https://aclanthology.org/L16-1043
Kaiser, J. (2019). Zur Stratifikation des FOLK-Korpus: Konzeption und Strategien. Gesprächsforschung. Online-Zeitschrift zur verbalen Interaktion, (19 (2018)), 515-552. http://www.gespraechsforschung-online.de/2018.html
Leibniz-Institut für Deutsche Sprache. The IDS Introduces Itself. Retrieved June 2, 2024, from https://www.ids-mannheim.de/en/entrance-into-the-ids/
Reineke, S., Schmidt, T., & Deppermann, A. (2023). Das Forschungs- und Lehrkorpus für Gesprochenes Deutsch (FOLK). In A. Deppermann, C. Fandrych, M. Kupietz & T. Schmidt (Eds.), Korpora in der germanistischen Sprachwissenschaft: Mündlich, schriftlich, multimedial (pp. 71-102). De Gruyter.
Schenkein, J. (1978). Sketch of an analytic mentality for the study of conversational interaction. In J. Schenkein (Ed.), Studies in the organization of conversational interaction (pp. 1-6). Academic Press. https://doi.org/10.1016/B978-0-12-623550-0.50007-0
Schmidt, T. (2018). Gesprächskorpora. Aktuelle Herausforderungen für einen besonderen Korpustyp. In M. Kupietz, & T. Schmidt (Eds.), Korpuslinguistik (pp. 209-230). De Gruyter. https://doi.org/10.1515/9783110538649-010
Schmidt, T., Schütte, W., Winterscheid, J., Schürmann, M., Reineke, S., Schedl, E. (2023). cGAT. Konventionen für das computergestützte Transkribieren in Anlehnung an das Gesprächsanalytische Transkriptionssystem 2 (GAT2). Mannheim: Leibniz-Institut für Deutsche Sprache,. https://ids-pub.bsz-bw.de/frontdoor/index/index/docId/12186