Managing and using data banks of spoken language: perspectives from corpus providers and users
CHORD talk in interaction
15/09/2024
In January 2024, Johannes Wagner (University of Southern Denmark) and Carole Etienne (CNRS Laboratory Interactions, Corpus, Apprentissage, Représentations (ICAR), Lyon) appeared as guest speakers at the two-day workshop “Database evolution for the study of social interaction: Designing annotations for long-term usability” (workshop 3 organised by CHORD-Talk-in-interaction). They were invited to report on their experience with designing, managing, and updating digital corpus environments, more specifically TalkBank and CLAPI (Corpus de LAngue Parlée en Interaction). The CHORD-Talk-in-interaction documentation section contains a comprehensive report on the first day of workshop 3, covering the history of the databases under discussion and their adaptation to evolving technologies. That report also documents the platform providers’ commitment to adhering to the FAIR principles of scientific data management and stewardship, as well as their efforts to align the infrastructure with the evolving needs of their users. The article at hand concentrates on the second day of workshop 3. It first provides a summary of the presentations and then reports on the discussions sparked by the talks, which reflect the interests and concerns of the scientific communities involved.
Possible applications and use cases of TalkBank and CLAPI
The second day of workshop 3 began with a presentation by Johannes Wagner on “The usability of CA corpora on the web: How is it done, for what purposes are the data used, and who is using the data from where?”. Wagner presented Samtalebank, a component of TalkBank’s Conversation Analysis data bank (CABank), which features contemporary spoken Danish interactions. He described the structure and users of the data bank before providing two use cases. The first use case examined business calls from a Danish blue-collar worker between the years 1986 and 1991, seeking to identify longitudinal evidence for a growing connection between the callers (‘making friends’). The second use case presented the work of Jakob Steensig’s team who used Samtale together with the non-public corpus AULing to analyse grammar in Danish interactions (‘samtalegrammatik’).
Wagner furthermore elaborated on different types of data and whether they can be stored and visualised in TalkBank. He explained that written text, e.g., a transcribed phone call, is easy to store and visualise since there are established ways of encoding it. Transcribed video data can also be included in TalkBank, but things become complicated when visualisations of the interaction, e.g., in a graphic novel style (cf. Heinemann & Fox 2019) are to be saved. What is particularly impractical to store in TalkBank are unfocused, non-verbal interactions, such as in Mortensen and Wagner’s (2021) study on cyclists’ bodily reaction patterns in urban traffic. As there is no established encoding scheme for such data, they tend not to be annotated and, for this reason, cannot be queried in a digital database. The benefits of storing them in an online corpus infrastructure are thus questionable. However, considering the growing interest in embodied and non-verbal interaction, Wagner added, it could be prudent to think about new ways of encoding and annotating such data to enhance their reusability in data banks.
A final topic of Wagner presentation pertained to the relation between TalkBank’s functionality and usability. There tends to be a trade-off between these two features, since adding more functions to a service often makes that service more complex and thus reduces usability. Wagner stated that the developers of CLAN (Computerized Language ANalysis), which is TalkBank’s software for creating and analysing transcripts, had prioritised functionality over usability. This resulted in high power and efficiency, but also made the programme little intuitive and fault tolerant and therefore challenging to learn and apply.
The second talk of the day was given by Carole Etienne, who discussed the possible applications of CLAPI. The first part of her contribution addressed the topic “Learning to interact in French as a foreign language based on oral corpora and research findings” and presented CLAPI-FLE (Français Langue Etrangère), a sub-section of CLAPI specifically dedicated to language teaching. Etienne traced the development of a network of FLE researchers and practitioners around Véronique Traverso, which compiled and edited a collection of 40 transcribed extracts of social interactions in French, adding descriptions and instructions designed for a FLE context. The extracts are enriched by additional information for instance on the topic of the conversation, the language level, the type of discourse (e.g., formal), the variety of French (e.g., L1/native speakers), the domain (e.g., professional), the situation (e.g., purchase in a shop), the actions performed (e.g., to buy something, to thank, to take one’s leave etc.) or the used vocabulary. CLAPI-FLE furthermore provides fact sheets that elaborate on certain pragmatic processes and functions, as well as an 'in practice' section that explains the communicative, socio-cultural, interactional and/or linguistic objectives of selected extracts, including suggestions on how to use them in the classroom. According to Etienne, because of its complexity, CLAPI-FLE is adequate for teachers rather than students. However, there is another resource intended for the use by learners: CORAIL (Corpus Oraux, Apprenant, Interaction, Linguistique) offers a set of French interactions with a simple vocabulary, deemed accessible to learners. It provides explanations on how to perform certain actions in French (Comment dire? Comment faire?), e.g., to ask a question (poser une question) or to express a feeling (exprimer une émotion). Additionally, it lists a range of everyday expressions (les expressions du quotidien), such as ça marche or tu vois, and enriches them with a brief explanation and exemplary recordings of their use in naturally occurring French interactions. The last oral corpus for teaching purposes presented by Etienne was INTERFARE (INTERagir plus Facilement en REunion), a resource that includes recordings of work meetings to help course designers and teachers in vocational or academic training understand the workings of this activity type, including multimedia activities and frequent expressions in meetings.
Etienne’s second contribution was entitled “Articulating quantitative and qualitative analyses in CLAPI” and presented some of the platform’s tools and how they can be used in studies with different research designs. Etienne identified three stages of research within which CLAPI can be fruitfully used: i) to validate a research hypothesis, ii) to build a collection in accordance with the object of study, iii) to formulate a hypothesis and potentially identify novel objects of study. For each of these approaches, she showed suitable CLAPI tools. For instance, CLAPI’s concordancer, the calculation of co-occurrences and multi-criteria queries can be used to validate a research hypothesis, whereas word lists and further statistics (numbers and percentages for a search term) can be used to generate a hypothesis. When building a collection, metadata can be used to find occurrences of the same phenomenon in different interactional genres. Etienne closed her presentation with the recommendation to work in a continuous, iterative circle comprising both quantitative and qualitative methods.
Each presentation was followed by a discussion with the online and on-site participants of the workshop. Among the various topics addressed, three emerged as particularly central, which we will turn to in the following sections of this report.
Spoken corpora as multi-purpose resources addressing various user groups
At several points in the discussions and presentations, the workshop participants and guest speakers referred to the variety of research interests that motivate the use of corpus data in the fields of linguistics concerned with spoken language. This variety of perspectives is challenging for both corpus developers and users. Wagner argued that corpus builders and managers need to have an understanding of their users to be able to design and maintain a resource tailored to their needs and (technical) abilities. The users, their disciplinary backgrounds and research interests influence not only the type of data being provided, but also the ways of processing, encoding, visualising, and accessing them. If the main user group is composed of, say, phoneticians, a corpus platform may strive to provide audio recordings of high quality. If users have a sociolinguistic background, it may be wise to invest more time in collecting and providing rich event- and speaker-related metadata. If users are corpus linguists interested in morpho-syntax, a platform may want to provide elaborate query tools.
Etienne, too, shared her experience with different user groups. Besides various successful collaborations, she also mentioned communication issues between corpus linguists and other linguistic disciplines, due to differing standards and objectives. A consequence of such discrepancies is that in some cases search tools and data do not match perfectly. She also recalled difficulties to comply with individual user requests, particularly when they are looking for a large number of hits. Workshop participants then asked how corpus builders decide which individual requests to meet and which to reject. Etienne explained that, as far as CLAPI is concerned, the focus is on spoken language in interaction and the platform therefore primarily complies with requests made by scholars in interactional linguistics and conversation analysis. Requests that conflict with the approaches of these disciplines are usually rejected. Sometimes, moreover, the implementation of a requested innovation appears impractical for technical reasons (see also the report on the first day of workshop 3 under “Implementing user needs”). Wagner agreed that, to maintain coherence and usability, the spectrum of services provided by a platform cannot be indefinitely wide. He nevertheless encouraged the workshop participants to approach corpus providers with requests, which reveal the needs of the involved communities - even if not all problems can be resolved immediately. In video data of non-verbal interaction, for example (a case he had mentioned in his presentation), it could be helpful for corpus providers to learn what users are looking for in the data, such as to develop coding schemes accordingly, to the benefit of future users.
The topic of data encoding and visualisation inspired some workshop participants to enquire about front-end programming. In the case of audio/video recorded and transcribed spoken interactions, the front-end seems to be particularly challenging because it needs to not only offer an interface to query text, as commonly implemented in workbenches for written corpora, but also to align multimedia files with transcripts. The environment becomes even more complex when several transcript versions co-exist, for example one following the Jeffersonian transcription convention (Jefferson 2004) and another that visualises the interaction in a graphic novel style. Unlike repositories, which have the primary purpose of storing data, or modest databases managed by individual researchers, programming the front-end of a more advanced platform is a task that ideally requires linguistically trained researchers to collaborate with professional programmers and web designers. Wagner elucidated that, at this regard, TalkBank does not entirely conform to the expectations of contemporary users. The front-end and the back-end are essentially analogous, reflecting the habits of the era in which the platform was created. To develop an up-to-date graphical user interface, the managers of the platform would need to invest considerable resources. For the time being, the available resources are rather being used to guarantee the core functions of TalkBank, i.e. to maintain and extend the database.
Interoperability and standards
Another major topic of discussion was related to data export and processing in external programmes, a type of workflow that raises questions of interoperability.
As discussed previously, online databases such as TalkBank and CLAPI offer an integrated environment in which both corpus data and search tools are available, thereby enabling users to interact with the data online. This kind of service is crucial when platforms are obliged to restrict download options for reasons of data protection and is, independently from that, appreciated by many platform users because it allows them to execute certain common analytical tasks easily and rapidly. Some researchers, however, prefer to work offline and/or to employ different applications to process corpus data, be it because they are familiar with them or because they allow them to perform additional tasks. Provided that download is not precluded for security reasons, how can a platform best assure that data are reusable in various programmes? In what formats should data be stored and exported so that they can be used by different researchers and their tools, e.g. Praat, Transana, INCEpTION, etc.?
Wagner underlined the importance of export functions with in-built conversion tools for widespread formats, even if moving data across systems bears the risk of losing information. He encouraged the workshop participants to approach database managers, if necessary, and communicate which additional export formats would be helpful to them. Even when data are protected by access restrictions, it is possible to contact data owners and still reach an agreement to obtain the data. If platform managers know which formats are needed most, suitable converters can be developed or implemented. Etienne gave an example: She reported that some CLAPI users had asked for the possibility to export transcripts in TXT or DOCX formats to be able to edit them as text documents for publications. The request was fulfilled, and as a result, each transcript can now be accessed in either the original TRS format, generated by the transcription editor used by CLAPI, or the RTF format, which can be used as an interchange format between text editors from different manufacturers on different operating systems.
Considering the numerous applications and formats used by linguists working with spoken corpora, one participant turned to the guest speakers with an enquiry regarding standardisation. Researchers in the field use different transcription software, transcription conventions, and corpus platforms. It is not uncommon for them to become proficient in one tool or system and then adhere to it for an extended period. In the future, would it be advantageous to strive for more uniformity? Alternatively, would it be more beneficial to permit diversity and then educate students accordingly, for instance through PhD schools, in order to prepare the next generation to navigate these different online corpora or environments? Wagner responded that, in his estimation, despite the evident advantages of standardisation, it was unlikely that the existing systems would converge towards one particular solution in the near future; it appeared implausible to make one system fit every purpose. Rather than attempting to unify diverse systems and programmes, he deemed it more promising to maintain diversity and work towards making the various systems interoperable. Etienne agreed, adding that the systems currently used have different strengths and benefits – some are especially useful for multimodal annotations, others are good at finding keywords etc. From her experience, when merging tools or services, the risk of losing information is even higher than when converting documents from one format to another. She reminded the participants of TEIcorpo, a tool developed by Christophe Parisse that she had mentioned during the first workshop day. The tool converts Elan, Clan, Transcriber, and Praat files to XML files that conform to the TEI scheme for spoken language. TEI stands for Text Encoding Initiative, “a consortium which collectively develops and maintains a standard for the representation of texts in digital form” and whose “chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics” (TEI: Text Encoding Initiative). The TEI XML schemes are widely used in the fields just mentioned and interpreted by various applications, which is why they were chosen as a target format by the developers of the TEIcorpo converter.
Citation practices
The last discussion topic we would like to recall here pertained to the communication between corpus builders, platforms and researchers who reuse data. Both guest speakers assessed the culture of platform use from their point of view, addressing certain problems of data handling. In particular, according to their experience, researchers surprisingly often include data in publications without naming the platforms as a source. This observation inspired some workshop participants to ask if users need more information about what types of use are permitted and how platforms and corpus owners must be cited. Is it necessary to draft codes of conduct and train young researchers accordingly? Wagner referred the workshop participants to TalkBank's ‘Ground Rules’ listed on the platform’s main page, which give information about the conditions of use of the platform and of the data hosted. The invited speakers and participants agreed that proper citation is a crucial open science practice because producing and sharing data requires considerable efforts. The acknowledgment of a resource's production is a necessary symbolic reward and an important incentive for researchers and institutions to make data available. Evidently, the cultural aspects of data-sharing need to be negotiated and developed, alongside with more technical aspects of open science. In this context, Etienne advocated for a better training of young researchers and students, and for the establishment of best practices.
Nina Profazi & Johanna Miecznikowski
Presentations
Etienne, C. (2024, January 15-16). Learning to interact in French as a foreign language based on oral corpora and research findings [Workshop presentation]. Database evolution for the study of social interaction: Designing annotations for long-term usability, Neuchâtel, Switzerland. PDF (3.11 MB)
Etienne, C. (2024, January 15-16). Articulating quantitative and qualitative analyses in CLAPI. [Workshop presentation]. Database evolution for the study of social interaction: Designing annotations for long-term usability, Neuchâtel, Switzerland. PDF (1.41 MB)
Wagner, J. (2024, January 15-16). The usability of CA corpora on the web. How is it done, for what purposes are the data used, and who is using the data from where? [Workshop presentation]. Database evolution for the study of social interaction: Designing annotations for long-term usability, Neuchâtel, Switzerland.
PPTX (8.79 MB)
Video documentation
References
Heinemann, T., & Fox, B. (2019). Dropping Off or Picking Up? Professionals’ Use of Objects as a Resource for Determining the Purpose of a Customer Encounter. In D. Day, & J. Wagner (Eds.), Objects, Bodies and Work Practice (pp. 143-163). Multilingual Matters. https://doi.org/10.21832/9781788924535-009
Jefferson, G. (2004). Glossary of transcript symbols with an introduction. In G. H. Lerner (Ed.), Conversation Analysis: Studies from the First Generation(pp. 13-31). John Benjamins. https://doi.org/10.1075/pbns.125.02jef
Mortensen, K., & Wagner, J. (2021). Cykling som kropbaseret computationel praksis. In N. Bonderup Dohn, R. Mitchell & R. Chongtay (Eds.), Computational Thinking. Teoretiske, empiriske og didaktiske perspektiver (pp. 81-106). Samfundslitteratur.
Data banks
CABank: https://ca.talkbank.org/
CLAPI, Corpus de LAngue Parlée en Interaction: http://clapi.ish-lyon.cnrs.fr/V3_Accueil_Corpus.php?interface_langue=EN
CLAPI-FFL, Corpus de LAngue Parlée en Interaction - Français Langue Etrangère: http://clapi.icar.cnrs.fr/FLE/
CORAIL, Corpus ORaux Apprenant Interaction Linguistique: http://clapi.icar.cnrs.fr/Corail/
INTERFARE, INTERagir plus Facilementen REunion: https://icar.cnrs.fr/interfare/
Samtale: https://samtale.talkbank.org/
TalkBank: https://www.talkbank.org/
Websites
Samtalegrammatik. https://samtalegrammatik.dk/en
TEI: Text Encoding Initiative. https://tei-c.org/