A Starter Guide to Video Transcription for Editors

A Starter Guide to Video Transcription for Editors

Tools and Tips for Transcribing Video

Whether it’s a how-to video on YouTube, a short corporate training segment, or a feature-length documentary, producing video content is a major undertaking with lots of moving parts. As a video editor, you play a crucial role in the post-production process and what the consumer ultimately sees. Ideally you are devoting most of your time to the artistic process of determining what to include versus cut, but it can be easy to get lost in technical processes and details.

One key way of both streamlining your editing process and expanding your viewership is by leveraging video transcriptions. In this guide, we will explore why video transcriptions should be a fundamental component of your work, how they can be used, and what tools are available to create them. Scroll down or click one of the sections below to read more.

What is Video Transcription?

Video transcription is the process of converting video content into a text format. Video transcriptions can then be used for a variety of purposes such as captioning video for a hearing-impaired audience, subtitling video with different languages, streamlining the post-production editing process, or creating text-only versions of video that are more easily indexed and therefore more discoverable online for search engine optimization (SEO) purposes.

Different Types of Video Transcriptions

  • Captioning

Captioning is typically used to provide deaf or hearing-impaired communities as rich a viewing experience as possible. Video captioning assumes that viewers cannot hear the audio. Sometimes called same-language subtitling, video captioning denotes not only dialogue but also other relevant audio content such as soundtracks and background noise in text.

Text usually appears white in a black box at the base of the screen, and non-dialogue content is typically presented in brackets (e.g., [knocking on door], [violin music begins]). Captions are time-synchronized so that the audience can read the text as that same content is being spoken on video.

Video captions may be closed, open, or live. Closed captioning refers to captions that viewers can choose to turn on or off. In contrast, open captioning is always visible and cannot be turned off by viewers. Live captioning occurs during news, sporting events, or other live broadcasts. A stenographer listens to the broadcast and types what he or she hears into a specialized device and computer program so that captions appear just seconds after something is spoken.

  • Subtitling

Subtitles assume that the audience can hear the audio content but does not understand the dialogue because it is in an unfamiliar language. Subtitles translate dialogue content into a different language but don’t include descriptions of background noise, music, or other audio cues. For instance, an English-speaker viewing a French movie on Netflix can turn on subtitles to read all the dialogue in English. Like captions, subtitles can be either closed (i.e., optional) or open (i.e., permanent).

Differences between Subtitling and Captioning

To summarize, captioning assists audiences who are deaf, hard-of-hearing, or who must mute a video’s audio content; subtitling translates video content into a viewer’s native or preferred language. Because of this, audio cues and background noises are denoted in brackets for captions but these are omitted for subtitles.

Subtitles also tend to have greater flexibility with fonts, colors, and positioning than captions do. While white text with a black rim or shadow is most common for subtitles, this can be altered. Similarly, their position is most commonly found at the lower portion of the screen but is also more easily altered.

Alternate Uses of Video Transcriptions

While captions and subtitles are the most common application of video transcriptions, this text content can also be used outside of video editing. People will often transcribe videos to improve the searchability of their video content online. Because search engines do not index audio or visual content, creating video transcriptions help potential viewers discover content more easily because of improved searchability and accessibility of video with transcripts in a site.[/vc_column_text][/vc_column][/vc_row]

Uses of Video Transcriptions

As a video editor, your work with video transcriptions can span a wide variety of sectors and specialties. Here are some of the most common uses of transcribed video:

  • Entertainment  - Transcribing videos and creating captions and subtitles improves the distribution and reach of movies, documentaries, TV shows, live sporting events, awards ceremonies, and other entertainment content.
  • Education - Video transcripts and captions make educational materials deaf-friendly and more accessible to hearing-impaired communities. Education content includes lectures, how-to or training videos, webinars, interviews, and other interactive materials.
  • Sermons - Including captions or subtitles with online videos of sermons makes content accessible to a much wider audience.
  • International shows and movies - Independent filmmakers rely on subtitles and captions when submitting their films to international film festivals such as Sundance or Cannes.
  • Repurposing video content - Video transcriptions can be repurposed for other uses such as writing articles, how-to guides, study guides, product descriptions, or as a foundation for other written content.

Advantages of Getting Videos Transcribed

  • More efficient post-production editing process

As a video editor, you typically have to condense a large amount of raw footage into featured content that is much shorter. Video transcriptions are one way of streamlining this process and making the editing process more efficient. Transcripts help you locate specific scenes or soundbites and facilitate paper edits.

A paper edit is a time-coded list of the segments you want to incorporate in the order you plan on using them. This list can be paired with notes on associated footage you plan on including (e.g., B-roll footage of interviewee eating at a restaurant). Creating a good paper edit can be a major challenge, especially if you are juggling large quantities of interview footage. Accurate transcriptions make it easier to scan through, highlight, edit, and re-order content during paper edits.

  • Section 508 and ADA compliant

Improving accessibility to videos for people with disabilities via captioning is not only business-savvy and the right thing to do, it’s also the law. The American Disabilities Act (ADA) and Section 508 require that any content developed, purchased, or distributed by the federal government must be accessible to people with disabilities. By creating captions with video transcriptions, you ensure 508-compliance for the hearing-impaired and deaf community.

  • Better search engine visibility

Video transcripts also play a major role in search engine optimization (SEO) because search engines do not index audio or video files. By transforming video to text, you improve its searchability. For instance, academics can transcribe conference presentations to increase exposure to their findings. Webinars, vlogs, speeches, sermons, and how-to videos are just some of the other source materials that gain SEO benefits from video transcription.

  • Better social media visibility

Social networks such as Facebook play videos without sound by default. Using video transcripts to create captions increases these embedded videos’ visibility, particularly when people view them in locations such as airports or hospitals where full volume viewing would be disruptive.

  • Increased viewership

By improving both accessibility and visibility for videos, you ultimately increase total viewership.

Elements of Video Transcripts

  • Timecodes/timestamps - Video producers rely on timecodes to synchronize various components of their work, such as shots taken from multiple cameras or audio that is recorded separately from video. Timecodes also help editors reference particular frames or scenes more easily. Timestamps are embedded in transcripts; readers can click on the timestamp and immediately refer to the corresponding video content. By pairing timecodes with timestamps, captions and subtitles are time-synchronized so that the audience can read content while the text’s corresponding images are on-screen.
  • Audio descriptions - Captions also require that all audio content is described. While automated transcription options may work for dialogue-only video, human transcriptionists are more effective at including descriptive audio content such as [gust of wind and windows rattling] or [cheering crowd]. Audio descriptions must be succinct but also descriptive enough to convey the video’s original intent and atmosphere.
  • Use of punctuation - Another area that requires either manual editing or human transcription is punctuation. The placement of punctuation may alter a caption or subtitle’s meaning and automated transcriptions do not address these subtleties very effectively. For example, “We’re going to learn to draw kids!” has a different meaning than “We’re going to learn to draw, kids!” Accurately conveying pauses, tone, and meaning require effective use of punctuation.
  • Timing - Timing captions goes beyond the use of timecodes and timestamps. As a video editor, you must not only be sure that viewers cannot read ahead of what they are seeing, you must time audio descriptions precisely. For instance, being able to read [gunshot] before other viewers can hear the sound may ruin the director’s intended experience.
  • File types - As a video editor, you may encounter a variety of file types when using transcripts. File types vary on multiple levels:
    • Files may be binary—only readable by computers or specialized hardware—or they may be readable text. Some may incorporate elements of both.
    • Files may be standardized so that they are accessible to anyone or proprietary and require one manufacturer’s suite of tools.
    • Files may have a simple default (i.e., white text centered at the bottom of the screen) or they may allow for a wide variety of customization. Some commonly used open formats include TTML, STL, SRT, and IMSC. Which file types you ultimately use will depend largely on the nature of the job (e.g., broadcasting, online clips, film, depositions).

Tools for Video Transcription

Depending on the nature of your video editing project, you may use different types of tools to generate video transcriptions. The most basic options use speech recognition software to automatically transcribe content while the most sophisticated rely more heavily on human transcribers.

When doing small scale projects, you may be able to tap into existing voice recognition software for free using your phone or computer. For instance, select “voice typing” on Google Docs on your computer while playing the video. Alternately, use the microphone on a word processing app on your phone to transcribe the recording while it plays.

Automatic transcription can also be done using paid versions of software programs (e.g., Adobe’s Premiere Pro, InqScribe). These can be purchased and downloaded onto your computer. Alternatively, you can upload your files to a web-based service (e.g., Trint, Rev) which use AI-based automated transcription. These services’ rates will vary depending on your content and additional features you want (e.g., human editing of automated transcripts).

While it is possible to make straightforward transcripts from video content with applications like these, the transcripts they generate are difficult to use as captions or subtitles without timestamps. If you’re publishing videos to a platform like YouTube, captions and subtitles will be generated automatically and you won’t have to transcribe the audio first. You will almost certainly have to edit them, however.

All these forms of speech recognition software share similar drawbacks; background noise, jargon, dialects and accents, slurring, mispronunciation, or mumbled words can all negatively affect accuracy. These tools almost always require manual editing to ensure there is no misrepresentation of the video. While they may be suitable for a simple how-to video, they will be frustrating to use for something like a documentary.

Depending on the length and complexity of the video you are editing, manually editing an automated transcript may become so time-intensive that using video transcription services makes more sense. Subtitling services and human transcriptionists can provide much greater accuracy and offer more advanced tools to properly sync text with the corresponding visual content.

For instance, TranscriptionWing™ offers a 3-stage proofing process to ensure both the accuracy of the transcript and the precision of time-synchronization. Timestamps are typically included and human transcriptionists are far better at differentiating speakers, understanding subtle changes in accents, and describing important audio cues for captions. They can also offer greater customization and more easily accommodate different file types.

Common Guidelines for Using Video Transcripts


Creating accurate and effective subtitles and captions may sound straightforward, but the process is often more art than science. Captions must parallel a viewing experience with sound while remaining short enough to be readable. Subtitles and captions must match the timing of the dialogue while being placed without blocking important imagery. Following are some core guidelines for using transcripts for captions and subtitles:

  • Succinctness - Keep your captions and subtitles short, using no more than three lines of text.
  • Consistency - Maintain consistency with the videographer’s intent in regards to both meaning and style. For example, write words exactly as they are spoken; don’t change “yeah” to “yes” in a caption. Also, take account of important stylistic choices, noting speakers’ accents, music lyrics, and tone.
  • Differentiation - Choose a style of speaker differentiation that makes it easiest for viewers to follow dialogue easily. Alter the color of text for speakers or label speakers. Be sure to note if someone off-screen is speaking so that a scene is clearly understood.
  • Positioning - Make sure your subtitles and captions do not obscure other visual information. For instance, additional information often appears at the base of the screen during news broadcasts; you may have to move your live captioning to the top of the screen. Positioning captions may also become relevant when assisting viewers with speaker differentiation. Sometimes captions can be positioned next to the person who is speaking rather than using labels.


If you’re not already using video transcriptions during your editing process or to create captions and subtitles, then you’re missing a major source of time-savings and means of adding value to your work. Avoid headache during paper edits, increase viewership of your work by expanding its accessibility, and preserve more time for your artistic eye than irritating technicalities. Video transcriptions are just a small component of your work but they will move you one step closer to seeing the forest through the trees.

Free Recording Service