375 hours of Danish dialect recordings released: Empowering Danish speech technology

Rasmus Hørving-Mulberg
September 16, 2024

Rasmus Hørving-Mulberg

@Rasmus

This post is also available in: Danish

Over the past two years, Danes from all over the country have donated their voices to a new speech dataset that will improve the use of Danish speech technology. The technology is growing globally and will improve voice-activated assistive technology and help streamline routine tasks such as note-taking.

For speech technology to work optimally, it requires large datasets, and Danish has previously lagged behind as it is a small language area. The Alexandra Institute, in collaboration with several partners, has collected around 375 hours of Danish speech – part of a larger ambition to create a dataset of 1,000 hours. The goal is to make it the largest Danish speech dataset to date, with a broad representation across gender, age and the many different dialects and accents in Denmark.

“One of the unique aspects of the dataset is that it has a broad representation of the entire country,” says Dan Saattrup Nielsen, Senior AI Specialist at the Alexandra Institute in a press release.

The dataset can be used for many purposes, including transcription and hearing aid development.

Minimize bias in datasets

Previous datasets have been relatively small and dominated by young urban males, which has affected the accuracy of speech recognition for those who speak dialect, are older or of a different gender.

“This means that the models trained on the dataset will be much better able to handle the different ways we speak out in the countryside, thus minimizing the bias of existing datasets,” explains Dan Saattrup Nielsen.

This will improve technologies like voicebots in customer service and automated note-taking in healthcare. Businesses will also benefit from more accurate automated meeting minutes. As part of the project, the Alexandra Institute has also developed a test dataset that makes it possible to test the accuracy of existing speech recognition systems from Google and Microsoft across different factors such as gender, age and dialects.

“With it, you can test exactly how good those systems are. It can help companies or the public sector make better decisions about which system to use,” says Dan Saattrup Nielsen.

The data set now released is the first part of the project. During the fall, a second part will be released with two-person conversational data that reflects more natural conversations. The project aims to release up to 1,000 hours of data within the next year, which will include both reading and conversation.

Facts about CoRal

CoRal is an initiative that has collected over 2,000 Danes’ dialects and accents to create a comprehensive speech dataset.

The goal is to have a dataset with over 1,000 hours of Danish speech, representing all age groups, genders and regional variations.

The project is a collaboration between the Alexandra Institute, the Department of Computer Science at the University of Copenhagen, Alvenir, Corti and the Danish Agency for Digitization and has a total budget of DKK 22 million, of which DKK 14 million comes from Innovation Fund Denmark.

The dataset can be downloaded here.

UGENS STARTUP:

Aarhus