October 8, 2020
There are several parts to our project – all important and all intertwined. To help make sense of our updates, I’ll try to tag them according to their most prominent feature in terms of speech-to-text progress, community engagement, professional development, and LGBT collections. To kick things off – an update on how things have progressed and where things stand in our speech-to-text processes. The good news is we have made progress! And here’s how:
Our goal is to extract meaningful text transcripts from the hundreds of hours of archival AV materials that are part of our LGBT collections in the UWM Archives. This includes oral history interviews, cable access shows dedicated to LGBT life in Milwaukee that debuted in the 1980s and 1990s, local radio programs from the same era, and news segments from Milwaukee’s local mainstream news media. It’s important that the text we extract is accurate enough to enable meaningful research. That means that language used by and about the LGBT community is especially important to capture accurately. In a later post we’ll talk more about the issues raised by use of terms that are considered outdated and even offensive – an extremely important topic when we are considering how to make these data sets and the archival materials themselves publicly accessible. For now, we’ll concentrate on getting the text out in the first place:
The project is using a publicly licensed speech-to-text engine, Mozilla’s DeepSpeech. It is “pre-trained” using the Mozilla Common Voice dataset. The code is open, so the model can be further trained locally, to make up for deficits in the Common Voice dataset or to fine-tune (within reason) to a particular collection of AV. We’re using DeepSpeech for all of the above reasons, and in order to develop a model that will work especially effectively with the archival AV materials that exist across the LGBTQ+ collections in the UWM Archives.
Throughout the summer, Dan Siercks worked with DeepSpeech to augment the model in order to raise our confidence that the transcripts would recognize expected language that the Milwaukee LGBTQ community might use to describe themselves and their experiences. Cary Costello, our Disciplinary Scholar and UWM’s Director of LGBT Studies, created a “terminology list” split into contemporary and historic LGBTQ+ terms, with each divided into tiers from most generic/mainstream to the most narrow used by specific subcommunities.
Using a local test collection that included oral histories with human-created transcripts, Dan developed a script to identify terms from Cary’s list in our test collection transcripts. We then manually identified start-and-stop times (made even easier because these transcripts have been coded using OHMS) and dropped those criteria into a spreadsheet. Using Audacity, Dan was able to create a simple process to locate and “snip” those audio snippets from the recordings to create a test data set that emphasized the LGBTQ+ terms we want our model to more reliably identify. While this approach needs to be carefully calibrated in order to avoid overcorrecting the model, we have found that our data sets have improved accuracy. We believe this is due to the term-specific augmentation, having run multiple iterations of the model, and a significant improvement in the model itself following an upgrade that we recently implemented locally.
We’ll report out with more specifics and examples soon!