Introducing - Transcribing news episodes from Sveriges Radio

Av Peter Örneholm | Blogg | 25 mars 2020 is a site that transcribes news episodes from Swedish Radio and makes them accessible. It uses multiple AI-based services in Azure from Azure Cognitive Services like Speech-to-Text, Text Analytics, Translation, and Text-to-Speech.

By using all of the services, you can listen to “Ekot” from Swedish Radio in English :) Disclaimer: The site is primarily a technical demo, and should be treated as such. - Screenshot of list


Just to give you (especially non-Swedish) people a background. Sveriges Radio (Swedish Radio) is the public service radio in Sweden like BBC is in the UK. Swedish Radio does produce some shows in languages like English, Finnish and Arabic - but the majority is (for natural reasons) produced in Swedish.

The main news show is called Ekot (“The Echo”) and they broadcast at least once every hour and the broadcasts range from 1 minute to 50 minutes. The spoken language for Ekot is Swedish.

For some time, I’ve been wanting to build a public demo with the AI Services in Azure Cognitive Services, but as always with AI - you need some datasets to work with. It just so happens that Sveriges Radio has an open API with access to all of their publically available data, including audio archive - enabling me to work with the speech API:s.


The site runs in Azure and is heavily dependant on Cognitive Services. It’s split into two parts, Collect & Analyze and Present & Read. - Architechture

Collect & Analyze

The collect & analyze part is a series of actions that will collect, transcribe, analyze and store the information about the episodes.

It’s built using .NET Core 3.1 and can be hosted as an Azure function, Container or something else that can run continuously or on a set interval.

  1. The application periodically looks for a new episode of Ekot using the Sveriges Radio open API. There is a NuGet-package available that wraps the API for .NET (disclaimer, I’m the author of that package…). Once a new episode arrives, it caches the relevant data in Cosmos DB and the media in Blob Storage.

    Screenshot of SR API episode details result

    The reason to cache the media is that the batch version of Speech-to-text requires the media to be in Blob Storage.

  2. Once all data is available locally, it starts the asynchronous transcription using Cognitive Services Speech-to-text API. It specifically uses the batch transcription which supports transcribing longer audio files. Note that the default speech recognition only supports 15 seconds because it is (as I’ve understood it) more targeted towards understanding “commands”.

    The raw result of the transcription is stored in Blob-storage, and the most relevant information is stored in Cosmos DB.

    The transcription contains the combined result (a long string of all the text) the individual words with timestamps. A sample of such a file can be found below:

    Transcription result

    This site only uses the combined result but could improve the user experience by utilizing the data of individual words.

  3. All of the texts (title, description, transcription) are translated into English and Swedish (if those were not the original language of the audio) using Cognitive Services Translator Text API.

    A sample can be found here:

  4. All texts mentioned above are analyzed using Cognitive Services Text Analytics API, which provides sentiment analysis, keyphrases and (most important) named entities. Named entities are a great way to filter and search the episodes by. It’s better than keywords, as it’s not only a word but also what kind of category it is. The result is stored in Cosmos DB.

    Keyphrases and Entities

  5. The translated transcriptions are then converted back into audio using Cognitive Services Text-to-Speech. It produces one for English and one for Swedish. For English, there is support for the Neural Voice and I’m impressed by the quality, it’s almost indistinguishable from a human. The voice for Swedish is fine, but you will hear that it’s computer-generated. The generated audio is stored in Blob Storage.

  6. Last but not least, a summary of the most relevant data from previous steps are denormalized and stored in Cosmos DB (using Table API).

Present & Read

The site that presents the data is available at It’s built using ASP.NET Core 3.1 and is deployed as a Linux Docker container to Dockerhub and then released to an Azure App Service.

Currently, it lists all episodes and allows for in-memory filtering and search. From the listing, you can see the first part of the transcription in English and listen to the English audio. - Screenshot of list

By entering the details page, you can explore the data in multiple languages as well as the original information from the API. - Screenshot of details

Immersive reader

Immersive Reader is a tool/service that’s been available for some time as part of Office, for example in OneNote. It’s a great way to make reading and understanding texts easier. My wife works as a speech- and language pathologist and she says that this tool is a great way to enable people to understand texts. I’ve incorporated the service into Radiotext to allow the user to read the news using this tool.

Primarily, it can read the text for you, and highlight the words that are currently being read: Immersive Reader - Read

It can also explain certain words, using pictures: Immersive Reader - Picture

And if you are learning about grammar, it can show you grammar details like what verbs are nouns, verbs, and adjectives: Immersive Reader - Grammar

I hadn’t used this service before, but it shows great potential for making texts more accessible. Combined with Speech-to-text, it can also make audio more accessible.


I’ve tried to get a grip on what the cost would be to do run this service and I estimate that to run all services for one episode of Ekot (5 minutes) the cost is roughly €0,2. That includes transcribing, translating, analyzing and generating audio for multiple languages.

Also, there will be a cost for running the web, analyzer, and storage.

Ideas for improvement

The current application was done to showcase and explore a few services, but it’s not in any way feature complete. Here are a few ideas on the top of my mind.

  • Live audio transcription: Speech to text supports live audio transcription, so we could transcribe the live radio feed. Could be comined with subtitles idea below.
  • Improve accuracy with Custom Speech: Using Custom Speech we could improve the accuracy of the transcriptions by training it on some common domain-specific words. For example, the jingle is often treated as a words, while it should not.
  • Enable subtitles: Using the timestamp data from the transcription subtitles could be generated. That would enable a scenario where we can combine the original audio with subtitles.
  • Multiple voices: A natural part of a news episode are interviews. And naturally, in interviews, there are multiple people involved. The audio I’m generating now is reading all texts with the same voice, so in scenarios when there are conversations it sounds kind of strange. Using conversation transcription it could find out who says what and generate the audio with multiple voices.
  • Improve long audio: The current solution will fail when generating audio for long texts. The Long Audio API allows for that.
  • Handle long texts: Both translation and text analytics has limitations on the length of the texts. At the moment, the texts are cut if they are too long, but they could be split into multiple chunks and then analyzed and concatenated again.
  • Search using Azure Search: At the moment the “search” and “filtering” functionality is done in memory, just for demo purposes. Azure Search allows for a much better search experience and could be added for that. Unfortunately, it does not allow for automatic indexing of Cosmos DB Table API at the moment.
  • Custom Neural Voice: I’ve always wanted to be a newsreader, and using Custom Neural Voice I might be able to do so ;) Custom Neural Voice can be trained on your voice and used to generate the audio. But, even if we could to this, it doesn’t mean we should. Custom Neural Voice is one (maybe the only?) service you need to apply for to be able to use. In the world of fake news, I would vote for not implementing this.


This is an unofficial site, not built or supported by Sveriges Radio. It’s based on the open data in their public API. It’s built as a demo showcasing some technical services.

Most of the information is automatically extracted and/or translated by the AI in Azure Cognitive Services. It’s based on the information provided by Swedish Radio API. It is not verified by any human and there will most likely be inaccuracies compared to the source.

All data belongs is retrieved from the Swedish Radio Open API (Sveriges Radios Öppna API) and is Copyright © Sveriges Radio.

Try it out and contribute

The Source code is available at GitHub and Docker image available at Dockerhub.

Hope you like it. Feel free to contribute :)

Till inlägget