Getting the best results from Auto-Transcription

Using Telestream Cloud for auto-transcription of media files is a great way to get started on a caption or subtitle project.

All submissions will result in an automatically generated transcript with timing to match your media file. The timed transcript can be reviewed and edited on the fly in the Telestream Cloud console or directly populated into your MacCaption or CaptionMaker project. The more accurate the results the less clean up and editing is needed to make you transcript perfect. Below are some best practices and tips that can help increase the accuracy of the auto generated transcript from Timed Text Speech.

Isolating spoken word

For original content or media with multichannel audio, the dialogue only track can be isolated to eliminate noise, music, and sound effects. This isolation can be done with any video editing software or audio production tool. By submitting the spoken word alone to the Timed Text Speech engine, accuracy can greatly increase on the auto generated transcripts.

To do this a video editor can open their project in Adobe Premiere or Avid and silence all audio tracks that do not contain dialogue. Next, they can simply export an audio only file. Timed Text Speech can handle audio files such as .mp3, aiff, and wav.

In some cases a media files such as .mov or .mxf may contain multiple audio tracks. This could be a 5.1 mix or required isolated tracks for archival or transcode purposes. Within MacCaption or CaptionMaker users will have an opportunity to select any of the audio tracks within the video file before submitting their project to Timed Text Speech. By default the software will be submitting track 1 and 2. If there is an isolated audio track in the .mov or .mxf file, the software can be set to submit the alternate audio track instead. This means that the spoken word only track will be processed and results will in turn be more accurate.

Training speech engine via Cloud web console

In many cases, media files that require a transcription may contain names, phrases, and acronyms that are not common. The speech engine may consistently get these wrong causing users to manually correct the results again and again. To remedy this, Telestream Cloud’s console offers a way to train the Timed Text Speech engine by uploading a corpus text file.

This file is a simple plain text .txt document that contains a list of names and phrases that are used in a project. Users can upload this .txt document to any specific project that may require training to increase accuracy. We recommend that the .txt document contain a list of phrases instead of words.

For example an effective corpus text document would like like this:

John Galveston
CEO of the Company
Working with CDN providers
Transcoding and captioning solutions
MacCaption Software

An example of a corpus text file that is not effective looks like this:


By using phrases the speech engine knows what to expect and what other words are typically used with the new vocabulary. This means that results for custom vocabulary will increase in accuracy.

In some cases, creating a corpus text file for training is very easy and takes very little time. Some users simply repurpose the old transcripts or captions files from the same TV program or Project. For example, if a broadcaster needs to create a transcript for season 3 of a TV program, they can open the caption files from season 1 and 2 using MacCaption or CaptionMaker and export a corpus txt file that can be used for training Timed Text Speech. These 2 previous seasons contain the names and phrases that would greatly increase the vocabulary.

Another way that users can leverage the vocabulary training of Timed Text Speech is when a rough transcript is already available of the media file prior to submission to Telestream Cloud. This rough transcript will also contain the names and key phrases for the project. Timed Text Speech would then automatically time the rough transcript and fill in the text that is missing.

Content type that is best suited for ASR

The type of video content plays an important part in the level of accuracy using auto-transcription software. For example, a news show with a professional announcer and clear studio audio will have great accuracy vs. a video shot outdoors in a noisy environment on a mobile phone. In addition, loud music, singing, and shouting will also bring down the level of accuracy. There are also cases where speakers may change their voice to provide dialogue for children’s programming or for dramatic effect. This means that a speech engine that is designed and trained for standard voices may not be able to understand these voice tones. Generally speaking, project with studio quality recordings and professional speaker will always result in the best accuracy.

Creating a proxy

Professional video companies generally work with high quality video master files called mezzanine files. These files are used the same way tape masters were used in the old days. The original video must be uncompressed or high bitrate when submitted for processing. This is not the case for Timed Text Speech workflows. Because Telestream Cloud requires only the audio file for auto-transcription, users can submit a low bitrate MP4 or just the audio file. As long as the audio quality is good the video quality or resolution is not relevant for speech to text.

Voice over for the purpose of ASR

For video editing workflows, it’s quite common for editors to do a rough voice over when editing prior bringing in voice talent to do the final audio in the studio. This also provides an opportunity for video editors to re-speak any portions of the video project that do not have clear audio. This rough voice over can be then exported from the video editing system and submitted to Timed Text Speech for processing. This means that results will have a greater accuracy than the original audio.