Large data banks as continuous accomplishments: Developing, managing, and updating infrastructures for data sharing
CHORD talk in interaction
15/09/2024
Large data banks for spoken corpora have proved themselves as key tools within research on language and social interaction. Through their structure, metadata, and annotations they open up novel avenues to apprehend linguistic data and to initiate further research. The advantages of working with large corpora and the specific methodological implications for the fields of Interactional Linguistics (IL) and Conversation Analysis (CA) were extensively discussed during workshop 2 of the CHORD-talk-in-interaction project. What was explored less in the first months of the project are the intricate challenges posed by the development and management of data banks over time, for example when it comes to the organisation and annotation of corpus data, but also regarding the development of search tools through which the data may be accessed. Workshop 3 of the CHORD-talk-in-interaction project offered the opportunity to delve into these complex issues. The event was organised by Simona Pekarek Doehler (University of Neuchâtel) and Jérôme Jacquin (University of Lausanne) and took place at the University of Neuchâtel from January 15th to 16th, 2024, spanning two consecutive half-days. The workshop was offered in a hybrid format, allowing for both on-site and online attendance. Each half-day featured two presentations by two different speakers, followed by discussions with the participants (for the event page, click here).
The first invited speaker was Johannes Wagner (University of Southern Denmark), who is a member the advisory board of TalkBank - “the world’s largest open-access integrated repository for spoken-language data” (MacWhinney, 2019, p. 1919). The TalkBank project has been directed by Brian McWhinney over many decades. It emerged from an initial project named CHILDES, launched in 1984. The aim was to computerise, unite, and preserve data produced by different researchers that documented child language, thereby creating the conditions to make them available for other users. This project was supported by the US National Institutes of Health (NIH) and the US National Science Foundation (NSF) and was expanded in 2001 to host a greater diversity of data, with the ambition to document human language and communication from various perspectives (e.g., linguistics, psychology, computer science, education). Besides its scientific goal, TalkBank also emerged as a pioneering technological initiative, since it offered a comprehensive web interface that integrated a database, a transcription environment (CHAT) and an annotation programme to structure and analyse the data (CLAN). The integration of several tools provided researchers with an innovative technical environment in which various types of primary data, including audio, video, and pictures, could be systematically linked to transcripts and queried using CLAN commands. Today, TalkBank holds 14 different corpus banks, each documenting diverse facets of language use, most of which are freely accessible and downloadable. Wagner has for several decades been working with the CA related corpora under TalkBank.
The second speaker was Carole Etienne, research engineer at the CNRS Laboratory Interactions, Corpus, Apprentissage, Représentations (ICAR) in Lyon, founding member and current leader of the project Corpus de langue parlée en interaction (CLAPI). CLAPI hosts 70 corpora, which document the use of French in naturally occurring interactions in private and institutional settings (e.g., ordinary conversation, phone calls, service encounters, education settings). Besides her engagement in CLAPI, Carole Etienne takes part in various initiatives that promote the accessibility of spoken corpora and establish best practices for constituting, storing, and sharing such corpora.
Drawing on the expertise of these two speakers, workshop 3 was dedicated to the topic of the longevity of large data banks as research infrastructures for language data. Large data banks are to be viewed as “continuous accomplishments”, as Johannes Wagner put it, that is, as the product of constant effort put into their conceptualisation, management, and upkeep over an extended period. The present report covers the first day of workshop 3, which focused on the challenges faced by large data banks such as TalkBank and CLAPI over the years and the solutions and developments that had to be implemented to maintain these platforms. These challenges relate, in particular, to technological changes, evolving user needs, and institutional and political standards that emerged with regard to data management and data sharing (such as the FAIR principles – four concepts that emphasise machine-actionability, “i.e., the capacity of computational systems to find, access, interoperate, and reuse data with none or minimal human intervention”, GO FAIR initiative, 2022). In our report on the second day of workshop 3, we briefly address the possible applications of TalkBank and CLAPI, but otherwise concentrate on the discussions and questions by workshop participants that accompanied the presentations of that day.
Adapting to constantly evolving technologies
The continuous evolution of technology is one of the most significant issues faced by developers and managers of online databases. The development of new technologies forces databases to adapt in order to remain not only easily usable, but also accessible and interoperable. TalkBank has had to integrate these technological advancements since its initial development at the end of the previous century, as was highlighted by Wagner’s talk entitled “Conversation Analysis Corpora at TalkBank: Reflections of 20+ years of building sharing and updating conversation analytic corpora on the web. Is it worth it?”.
One major technological challenge faced by the managers of TalkBank - more precisely, of its conversation analytic component CA Bank - was the necessity to translate the widely accepted Jeffersonian transcription conventions (Jefferson, 2004) into a computer-readable format. Since these conventions were first developed on a typewriter, before the widespread use of computers, they involved some symbols that needed to be replaced or adapted to be computed by machines. One example given by Wagner are Jeffersonian symbols that encode multiple phenomena, such as the equal sign “=”, which indicates both continuation between turns and continuation within turns, or the parentheses “()”, which signal not only some elapsed time, but also unsure hearing or a transcriber’s description, depending on the content within the parentheses. The polysemy of these symbols needed to be eliminated so that transcripts could be adequately processed by computers, which are not able to disambiguate varying uses of the same symbol based on content or contextual information. Besides these difficulties, adapting Jeffersonian conventions also provided an opportunity, Wagner explained, to introduce new symbols to the convention, such as to encompass a broader range of vocal phenomena, especially for languages other than English.
In addition to issues closely tied to the specific aims and needs of conversation analysis, the development of computer technology and software also impacted the structure of data banks. To ensure their functioning and longevity, data bank systems must be continuously updated. Wagner highlighted several important technological updates that TalkBank underwent: the transition from ASCII to Unicode character encoding standards in the early 1990s, which necessitated a complete re-encoding of the data, the integration of the markup language XML and, finally, the modification, within the Macintosh processing system, from a 16 bit to a 32 bit architecture, which required the implementation of a new version of the CLAN programme to maintain its usability. Wagner stressed the crucial need for large data banks to remain adaptable to technological change and the significant effort this requires, as these developments cannot be anticipated, but must be dealt with upon their occurrence.
Observing data management standards: the FAIR principles
Carole Etienne’s talk “Sharing our multimodal corpora in CLAPI: what the fair principles imply, from the choice of metadata to the reuse of our corpora in other research” took into account technological progress and changing requirements since the creation of CLAPI, primarily addressing the challenges posed by institutional norms and political choices with regard to data storage and data sharing, such as the FAIR principles. She highlighted how the platform’s data modelling and search tools were designed to be aligned with these principles, which stipulate the findability, accessibility, interoperability, and reusability of data.
CLAPI provides general metadata for each corpus, such as the corpus name, the investigator’s name, the year of collection, the type of data (e.g., audio or video), the number of recordings and their total duration. This organisation enhances the data’s findability within the platform and - via programming interfaces - across platforms. Additionally, CLAPI offers audio or video snippets of the data (either 20 or 40 seconds) that users can stream or download to gain a better understanding of the corpora and find out which resources are relevant for them.
Another issue CLAPI addresses is the heterogeneity of the data they host, which originate from various sources and do not conform to a unique, coherent corpus design. When integrating data provided by researchers, the platform strives to avoid an overrepresentation of certain settings, genres, or speaker groups over others. Etienne gave the example of a collection of game interactions of approximately 20h. It was meant to be included in CLAPI, but the size of the corpus would have led to an overrepresentation of this interaction type compared to others. To avoid this imbalance, CLAPI developed a sub-platform specifically for this corpus. While this procedure avoids troubles of data representativity, it makes data less findable, since not all data are stored and accessed in the same location. In the case just mentioned, a systematic link was created between CLAPI and its sub-platform to address this issue.
As for the reusability of the data, both video files and separate audio tracks of the primary data are provided whenever possible, so that users not at ease with video data can still reuse the data, choosing to work with audio files alone. Transcripts are offered both in their original version and in a simplified format, which does not take into account prosodic and interactional features of spoken language such as pauses, intonation, and overlap.
As to search tools, CLAPI has opted for an infrastructure tailored to researchers in CA and IL, while simultaneously providing metadata and annotations that are useful for researchers outside these communities. CLAPI offers the possibility to document extensive sociolinguistic metadata about interactions (e.g., interaction types, number of speakers, level of formality) and speakers (e.g., age, gender, first language, occupation), allowing users to filter and search the database according to these criteria. The database also integrates common corpus linguistic tools to extract frequency lists (based on both types and tokens), find co-occurrences and list concordances. The concordancer is particularly suitable for CA, as it provides a large amount of context preceding and following the target item and offers the option to stream or download the audio/video excerpt in which the item is located. In addition to these tools, Etienne mentioned further tools specifically developed to analyse interactional and oral features of the data. For example, the platform supports queries based on the number of speakers, the position of the target item within a turn (a shortcut we use here to refer to the annotated unit that in CLAPI approximates the turn as an emic phenomenon) or the size of a turn in words. It also offers filter options for search results with respect to specific details of the sequential context, such as the production inside or outside an overlap or before or after a pause.
At the end of her presentation, Etienne highlighted a significant ongoing endeavour to comply with the FAIR principles: to enhance findability and interoperability, CLAPI shares data with other platforms. For example, a portion of CLAPI corpora is accessible via Ortolang. This infrastructure aggregates 316 corpora (193 of which are oral) and offers various tools for data analysis and visualisation. It defines a set of metadata that intersects with those proposed by CLAPI, but also includes other descriptors such as the number of words in transcripts, character encoding (e.g. ISO-8859, UTF8) and annotation level (conversational, orthographic, morphosyntactic, etc.). Ortolang also offers corpus management tools that are compatible with the Text Encoding Initiative TEI. TeiConvert and TeiCorpo convert various annotation formats (e.g., ELAN, CLAN, Praat) into TEI formats and vice versa. The online metadata editor TeiMeta, on the other hand, allows to modify the nodes of an XML metadata scheme starting from an existing corpus description in ODD style (“one document does it all”) or to create metadata files from scratch. Another institution CLAPI collaborates with is the Corpus d’Étude pour le Français Contemporain (CEFC) initiative, which aims to collect a large range of datasets to document contemporary French in various uses and modalities (written and spoken). Unlike CLAPI, it offers detailed metadata regarding the sound quality and employs automatic syntactic tools for annotating data in terms of grammatical structures and dependencies. Finally, Carole Etienne mentioned the CORpus, Langues et Interactions (CORLI) consortium. Among other projects, CORLI has developed an ODD template adapted to oral corpora, which has been integrated into the TeiMeta editor as one among several metadata file models proposed.
While sharing across platforms increases the findability and accessibility of data from different perspectives, thereby conforming to the first two FAIR principles, it should however be noted that it also generates a significant workload for data-managers. Indeed, whenever new data is added to CLAPI, it must be added to Ortolang, too, and must be formatted to meet the specific formal and metadata schemes of each platform. Furthermore, Etienne pointed out possible conflicts between findability and accessibility on one hand and interoperability and reusability on the other. In fact, the results of the reuse of data within other projects and initiatives are not always fully satisfactory. For example, she showed some inadequacies of the automatic syntactic segmentation tool of CEFC when handling typical conversational phenomena such as repair and non-linear syntactic productions – a lack of interoperability at the level of data modelling that compromises the reusability of transcripts in the CEFC environment.
Implementing user needs
A crucial property of data bank structures is their capacity to adapt to the needs and practices of the concerned scientific communities. Large data banks are primarily conceived as tools to support researchers in their investigations. Scientific practices, however, evolve over time, changing the researchers’ needs and expectations with regard to large data banks; cases in point are the emergence, within CA and IL, of video-based research or of mixed methods paradigms that integrate quantitative findings. How should the managers of data banks for spoken corpora address and accommodate their users’ needs and requests?
Besides other requirements regarding CA Bank such as the use of the Jefferson conventions (see supra) and the systematic linking of transcripts with media files, Wagner highlighted one specific request which constituted a challenge for the management of the data bank: the possibility to collaboratively revise and complement existing transcripts within the platform. This request is rooted in analytical practices that are epistemologically central in CA’s ethnomethodological approach. Within that approach, transcription is viewed as “one way to pay attention to recordings of actually occurring events” (Jefferson, 1985, p. 25). In other words, each transcript is produced with a specific analytic aim, and it should therefore be possible to modify transcripts as new analytic aims are pursued. CA scientific practice therefore ideally demands a high amount of flexibility when it comes to using a data bank and resists full standardisation. It is obvious enough that this analytical preference raises technical issues since it implies crafting multiple coexisting versions of the same transcript file. This in turn creates troubles when querying data. While collaborative commentary is a feature in TalkBank that is uses in connection with other subcorpora, e.g. The Aphasia Bank, the decision taken for the CA banks was to avoid this additional complexity by keeping only one “master” file.
Within CLAPI, too, transcripts were adapted over time to satisfy user requests, but in terms of format rather than content. Initially, transcripts were provided only in the time aligned EAF format, an XML schema that is specific to the multimedia annotator ELAN (The Language Archive, 2017). However, since some users are not familiar with this application, CLAPI then started providing the transcripts in formats that are compatible with common text editors (e.g., DOC, RTF, TXT) and are therefore easy to reuse. Another request that was brought to the CLAPI managers’ attention was to provide an abstract of each recording, which would indicate the content and the unfolding of the interactions. As Etienne explained, this request was not granted because it was judged incompatible with the CA perspective, which aims at highlighting the emergent and contingent development of interaction as it is experienced by the participants themselves. Instead, CLAPI offers a short description of the setting in which the interaction takes place.
These examples show that while user needs do certainly constitute a priority for data bank managers, they may also raise technical and epistemological issues. The users’ wishes must therefore be carefully examined as to their technical feasibility and their adequacy with regard to the data bank’s scientific objectives.
The complexity of ‘heritage data’
Both TalkBank and CLAPI are open and cumulative data banks, meaning that new corpora can be added at any moment. These databases also function as repositories for “heritage data”, that is data that have been collected prior to the existence of data banks and to the practice of data-sharing. While these data increase the representativity and comprehensiveness of the data banks, they also raise specific technical and ethical problems.
Regarding their technical management, Wagner pointed out that the structure of heritage corpora usually does not correspond to the technical standards of the databases, since the data owners have collected them without having in mind their future integration into a platform. They thus necessitate much manual processing to be integrated. An example of such data hosted by TalkBank is the Jeffersonian corpus. When it was first included into the platform, it had to be converted manually to be compatible with CLAN’s query tools, notably inserting lines of code to indicate metadata (e.g., language, participants, etc.). While this minimal manual annotation ensures the searchability of the data, it is however not complete and therefore not fully responsive to technological updates of the system or of the query software. Heritage corpora cannot be processed automatically, differently from corpora that have been collected and designed to be integrated into TalkBank from the start. To illustrate this issue, Wagner reported how the automated implementation of a new collaborative transcription system at some point affected transcripts that were not encoded according to the TalkBank XML standards. Those appeared significantly impoverished in the frontend display, with a loss of information regarding the annotation of specific oral features and the distribution of turns between speakers.
An ethical issue posed by heritage corpora regards data access, as Etienne explained. Just like TalkBank, CLAPI hosts several corpora that were compiled years before the development of data-sharing platforms, one of them being the “Bielefeld” corpus gathered by Elisabeth Gülich. The consent forms used at that time did not consider the potential future dissemination of the data within the scientific community and the data therefore could not be fully shared through the platform. Consequently, CLAPI agreed with the data owner to provide only short samples that can be streamed (max. 40 seconds), but not downloaded. Researchers can, anyway, be granted full corpus access upon signing an agreement with the data owner.
***
This first day of the workshop offered various opportunities to acknowledge the importance of large databases for researchers working with audio or video recordings and transcripts of spoken interaction. Keeping such resources up to date over an extended period is an arduous endeavour, which requires both time and financial resources. Large data banks such as TalkBank or CLAPI strive to adopt metadata, annotation and transcription standards that ensure the longevity of the data they host and regularly update the infrastructure to adapt it to technological advancements, institutional constraints, and user preferences. The invited speakers also highlighted the interdependence between the data hosted and the hosting platform, pointing out that the possibility, for large databases, to continue functioning over time is contingent upon the structure of the corpora they host and upon choices that were made by corpus builders. This interdependence should be taken into account by researchers who do fieldwork and collect their own spoken data. Carole Etienne and Johannes Wagner emphasised the importance of raising awareness, especially among young researchers, about existing databases and their requirements, to enable them to maximise the interoperability and adaptability of the data they produce.
Guillaume Stern & Johanna Miecznikowski
Presentations
Etienne, C. (2024, January 15-16). Sharing our multimodal corpora in CLAPI: What the FAIR principles imply, from the choice of metadata to the reuse of our corpora in other research projects [Workshop presentation]. Database evolution for the study of social interaction: Designing annotations for long-term usability, Neuchâtel, Switzerland. PDF (3.10 MB)
Wagner, J. (2024, January 15-16). CA corpora at TALKBANK – Reflections over 20+ years of building, sharing, and updating CA corpora on the web. Is it worth it? [Workshop presentation]. Database evolution for the study of social interaction: Designing annotations for long-term usability, Neuchâtel, Switzerland.
PPTX (4.43 MB)
Video documentation
References
Papers
Jefferson, G. (1985). An exercise in the transcription and analysis of laughter. In T. A. Dijk (éd.) Handbook of Discourse Analysis, vol. 3. New York: Academic Press, pp. 25-34
Jefferson, G. (2004). Glossary of transcript symbols with an introduction. In G. H. Lerner (Ed.), Conversation Analysis: Studies from the First Generation (pp. 13-31). John Benjamins. https://doi.org/10.1075/pbns.125.02jef
MacWhinney, B. (2019). Understanding spoken language through TalkBank. Behav Res 51, 1919–1927. https://doi.org/10.3758/s13428-018-1174-9
The Language Archive (2017). ELAN Annotation Format EAF. Schema version 3.0. Max Planck Institute for Psycholinguistics. https://www.mpi.nl/tools/elan/EAF_Annotation_Format_3.0_and_ELAN.pdf
Websites
GO FAIR initiative. (2022, January 21). FAIR Principles - GO FAIR. https://www.go-fair.org/fair-principles/
Data banks
CABank: https://ca.talkbank.org/
Corpus de langue parlée en interaction (CLAPI): http://clapi.ish-lyon.cnrs.fr/
Corpus d’Étude pour le Français Contemporain (CEFC): https://repository.ortolang.fr/api/content/cefc-orfeo/11/documentation/site-orfeo/index.html
Ortolang: https://www.ortolang.fr/fr/accueil/
TalkBank: https://www.talkbank.org/
Further readings
Groupe ICOR (M. Bert, S. Bruxelles, C. Etienne, L. Mondada, V. Traverso). (2008). "Tool-assisted analysis of interactional corpora: "voilà" in the CLAPI database", Journal of French Language Studies, 18(01), 121-145.
Groupe ICOR (Bert, M., Bruxelles S., Etienne C., Jouin-Chardon E., Lascar J., Mondada L., Teston S. Traverso V.). (2010). Grands corpus et linguistique outillée pour l'étude du français en interaction (plateforme CLAPI et corpus CIEL), Pratiques 147-148 "Interactions et corpus oraux", 17-35
Groupe ICOR (H. Baldauf-Quilliatre, I. Colon de Carvajal, C. Etienne, E. Jouin-Chardon, S. Teston-Bonnard, V. Traverso). (2016). "CLAPI, une base de données multimodale pour la parole en interaction : apports et dilemmes ", In Avanzi M., Béguelin M.-J. & Diémoz F. (éds), Corpus de français parlés et français parlés des corpus, Cahiers Corpus
MacWhinney, B., & Wagner, J. (2010). Transcribing, searching and data sharing: The CLAN software and the TalkBank data repository. Gesprächsforschung, 11, 154-173. http://www.gespraechsforschung-ozs.de