A Beginner's Guide to Audio Transcriptions

A Beginner's Guide to Audio Transcriptions

What is audio transcription and how can audio be converted to text?

Has an Internet search ever led you to a particular podcast about the topic? Have you successfully filed an insurance claim before? Has market research ever guided your company toward a successful re-branding concept? Even if these scenarios don’t ring true for you, the chances are high that you have benefitted from the quiet work of audio transcriptions in one way or another.

Like a computer programming language, audio transcriptions function behind the scenes and the ways in which they facilitate work, learning, and documentation typically go unnoticed. They routinely help people improve workflow, increase access to content, simplify content-sharing, verify information, and save time. In this guide we offer a beginner’s guide to audio transcriptions, offering suggestions on when to use them and examining the pros and cons of different variations. Feel free to follow along or skip to the section that interests you most:

What is Audio Transcription?

Audio transcription is the conversion of an audio file into a text format or document. The audio file could be hard copy, such as on a disc, or it could be soft copy, such as a digital file. Files could be in .mp3, .wav, .mp4, .mov, .avi, or a host of other digital formats.

How do you transcribe audio?

Audio transcriptions are created in one of three ways. First, a human transcriber can listen to the audio files and type what is said. Transcribers may use technology such as software programs to slow down the speed of the audio, specialized headsets, and/or foot pedals to stop and start the audio. Tools like these improve the efficiency and accuracy of human transcriptionists’ work.

Alternatively, files can be transcribed automatically using a downloadable software program or a web-based AI platform. These systems vary in their level of sophistication and accuracy. Many require high quality audio files and non-accented, American English to achieve an accuracy of 80 percent or greater. Some programs can improve their degree of accuracy over time with machine learning. With repeated exposure to a speaker’s particular accent and/or frequently used jargon, these programs can generate increasing precise transcripts over time.

Finally, a combination of the previous two methods can be used to create audio transcriptions. A software program or web-based AI platform can generate an initial transcription with .mp3 to text or another audio to text format. This initial transcript can then be more thoroughly edited and refined by a human transcriber to achieve greater accuracy and capture more nuances.

How long does it take to transcribe audio?

Since we speak much faster than we can write or type, it is almost impossible to transcribe one hour of recording in just one hour. Unlike court reporters or stenographers who use stenotype machines to transcribe in shorthand, freelance or general transcriptionists who use the common keyboard can only type so fast even with the aid of foot pedals or hotkeys.

Generally speaking, experienced transcriptionists can transcribe an hour of audio or video file in about 3 to 5 hours. However, There are many factors at play that can lengthen or shorten the transcription time of an audio or video file, including:

  • Transcription style - Audio transcription can be done either in standard (or non-verbatim), pure verbatim, or smart verbatim (slightly verbatim but without unnecessary words or fillers like uhms and ahs). Understandably, it is much time consuming to transcribe every single utterance including false starts and fillers compared to standard or non-verbatim transcription.
  • Quality of recording - The accuracy of a transcript highly depends on the clarity of a recording. A professionally recorded interview in a studio would be much easier to transcribe compared to a conversation recorded outside with a lot of external or background noise. If a recording is too noisy, it may take a transcriptionist multiple rounds of deciphering everything being said by the speakers.
  • Number of speakers - The number of speakers contribute to the length of the discussion in a recording. Focus groups, roundtable meetings, or large conference calls can be more difficult and time-consuming to transcribe. It is also challenging to distinguish different voices and accurately identify who said what, much more so when people in a group tend to crosstalk or talk over each other.
  • Speaker traits - Speaker accent and talking speed can significantly affect the total time of transcription. If a person has a heavy accent, it may be harder to decipher what they are saying. If the speaker is a fast-talker, then it would mean more words to transcribe per minute, which can add up to a ton of time taken overall.
  • Topic familiarity or expertise - Audio transcriptions are used for all kinds of purposes, including in highly specialized or technical fields. Recordings that are full of specialized or technical terminology can be more challenging when a transcriptionist is inexperienced or unfamiliar with the topic they are transcribing. It would require a lot of research and editing time for a transcriptionist to successfully produce an accurate transcript for technical or jargon-filled recordings.

Uses of Audio Transcriptions

Courtroom proceedings typically require a more specialized type of transcription known as stenography. During live courtroom proceedings, licensed court reporters or stenographers transcribe everything that is spoken via a device called a stenotype. This allows stenographers to type in shorthand during live proceedings to be able to type at the same pace with people’s speeches. A computer then converts shorthand writing into words (i.e. legal transcripts).

Court transcription services require intense training and certification that is distinct from general transcription. Beyond the operation of a stenotype and an emphasis on legal terminology and comprehension, stenographers need a strong grasp of court policies and procedures. Many states accept a minimum level of certification from associations such as the National Court Reporters Association (NCRA) which issues a certification as a Registered Professional Reporter (RPR). Some states require additional licensure according to state-specific standards. In all cases, however, training and licensure are more rigorous and specific to court reporting only.

Audio transcriptions are used by professionals in a wide range of fields but their application in all domains revolves around note taking, sharing information, documentation, improving accessibility, and facilitating search engine optimization (SEO).

Note Taking and Sharing

Professionals are usually most familiar with audio transcription because of its note taking and sharing applications. People can take notes via dictations and these dictations can later be transcribed. Other audio files can also be converted into text for easier sharing with co-workers. Dictation transcription and audio transcription sharing are commonly used in the following fields:

  • Corporate - Busy managers and executives can easily dictate notes following internal or external client meetings, inspections, conferences, trade shows, and networking events. Once transcribed, these notes can be more easily be referenced or shared to keep colleagues up to date via a link or attachment.
  • Academia - Professors, doctoral students, and others involved in academic research can also use transcriptions to improve their workflow. Recordings of research interviews, group discussions, lectures, or dictated notes can be transcribed to assist with data-gathering, writing, and publishing research.
  • Market Research and User Experience - Marketing researchers and UX designers gather a large volume of audio and visual content from phone, web-based, and in-person interviews and focus groups. Audio to text transcriptions facilitate note taking, reviewing, and sharing findings with colleagues.
  • Journalism - Journalists likewise accumulate extensive audio and visual content from interviews, news conferences, speeches, lectures, and other source material. When these are transcribed, journalists can more easily reference them as they craft their reports, stories, and opinion pieces.
  • Medical - Dictation transcription helps doctors and medical professionals efficiently record, share, and reference patient notes. Patient notes serve both as a record of observations and treatments and as a means to share this information with colleagues.
  • Legal - Attorneys, legal and insurance professionals can use dictated notes, recordings of interviews or statements, hearings, and other audio files to take notes on cases and to share evidence and arguments with colleagues.
  • Government - Politicians and public service professionals may create and/or reference audio transcriptions of cabinet minutes, council meetings, public reports, government research, police interviews, and more.

Documentation and Accessibility

Text files have a much smaller file size than audio and video files and are an excellent way of storing evidence, keeping archives of research, and documenting events or information. Many of the fields listed above also use audio transcriptions because they make documentation more accessible. Examples of using transcriptions for documentation include:

  • Lawyers creating memorandums or archiving client interviews
  • Corporate executives documenting expenses
  • Researchers archiving the outcomes of experiments
  • Doctors documenting treatment plans
  • Journalists archiving source materials
  • Government or business personnel dictating meeting minutes

Audio transcriptions can also improve accessibility in other ways. For example, a voicemail transcription can be quickly scanned for relative urgency during a meeting or presentation in a much less disruptive way than listening to a voicemail or leaving to take a call.

Transcriptions also improve searchability for individual and internal use. For instance, an attorney with transcriptions of multiple interviews, meetings, and phone calls could search her archives for particular words or phrases when building a case. Or a marketing researcher looking for novel insights could search his interview transcriptions for a particular client and note the frequency with which respondents used certain words or phrases.

SEO Benefits

Audio transcriptions play a major role in search engine optimization because search engines do not index audio or video files. By transforming this content into text, transcription expands its searchability. For instance, podcast transcriptions can help people locate specific podcast episodes which are relevant to their searches. Business leaders and academics can transcribe conference presentations to increase exposure to their findings. Webinars, vlogs, speeches, and how-to videos are just some of the other source materials that gain SEO benefits by being transcribed.

Audio Recording Strategies for Transcriptions

One of the key determinants of whether or not transcripts are accurate and useful is the type and quality of the audio files being used. There are two general types of audio formats: hard copy and soft copy. Hard copy media includes CDs, DVDs, cassette tapes, and any recording that is preserved in a physically handled format. Hard copy media tends to slow down transcription turnaround time because it must be first converted to soft copy for faster and more accurate transcription.

Soft copy is the preferred audio format for transcription for a number of reasons. First, automated transcriptions generated by software and online AI platforms require soft copy files as inputs. Human transcribers prefer soft copy as well because it allows them to manipulate the audio to improve their accuracy. They can slow the speed of recording and adjust the sound for greater clarity in a way that is not possible with a hard copy format. Lastly, soft copy media is more easily and rapidly shared via email or an interface where it can be uploaded and sent.

While .mp3 and .wav files are the more commonly used formats of soft copy media, there is an incredibly wide range of formats that can be transcribed. These include:

.aif .aiff .snd .au .dwd .iff .svx .sam .amr .smp .vce .voc .vox
.flv .pcm .wma .m3u .ogg .m4a .dss .wmv .dvf .mpg .avi .mov .mp4

Advances in technology mean that most people can produce a decent soft copy audio file just using their phones. Many automated web-based AI platforms and human transcription services offer apps that can seamlessly record and transfer audio files to be transcribed. However, there are some important factors to consider first which can easily improve or compromise audio quality and therefore enhance or hurt the end product:

  • Choose the type of digital recorder that matches the needs of the situation. For instance, a phone with a built-in microphone will take a poor recording of a roundtable discussion but could be sufficient for dictating personal notes. Investing in higher quality digital recorders can vastly improve transcript quality in the long run, particularly if they are being frequently used to record conversations of more than two people or larger venue events with more ambient noise.
  • Use external microphones whenever possible for multi-speaker recordings rather than built-in microphones on devices. Keep microphones as close as possible to whoever will be speaking or centrally located for group discussions where each speaker will not have a microphone.
  • Turn off voice activation on digital recorders so that long pauses do not trigger the beginnings of subsequent sentences to be clipped.
  • Avoid crosstalk as much as possible; remind people that only one person should speak at a time.
  • Avoid background noise that can interfere with audio quality such as shuffling paper, eating snacks, and excessive movement of chairs, footsteps, doors, etc. Choose the quietest location possible.
  • Test recorders and microphones before a crucial project needs to be transcribed. Practice transferring files from devices when transcriptions are not urgent or important so that preventable issues can be resolved in advance.

Audio Transcription Styles and What to Choose

Accuracy is obviously important to all transcription users. However, the type of accuracy each person or entity requires may vary. In some instances, it can be distracting to read, “Well, we, uh, sent the, uhm…the reports and then…” In other cases it is important to capture the speaker’s original hesitation in text and, “Well, we sent the reports and…” would not capture the nuance of the original audio.

These variations can be broken down into three styles of transcription:


Non-verbatim transcripts are polished for easier reading. These transcriptions re-phrase people’s statements to be grammatically correct and avoid any filler words speakers may use. For example, “gonna” would be changed to “going to” and non-words such as “uhm” would be omitted. While colloquial phrases and filler words may be less distracting to listen to, readers tend to find them more frustrating to read. Transcriptionists do not paraphrase anything said by speakers in non-verbatim transcripts, they simply remove the clutter in the audio to improve readability.

Smart Verbatim

In a smart verbatim transcript, every audible word from the audio file is captured. Interjections such as “oops!” and “shh” are all included. Words such as “gotcha” or “unh-uh” are not formalized to “I understand” and “no” respectively. Furthermore, stutters, false starts, and repeated words are transcribed just as they are said, such as, “Yes, I…I did…I…Yes, I confirmed that.” By capturing everything as it is verbalized, the transcription captures the essence of the audio as it happened in real life.

Pure Verbatim

Pure verbatim transcriptions also capture every audible word delivered by speakers exactly as they said them. Grammatical errors and colloquialisms likewise remain. The key difference with pure verbatim is the inclusion of every utterance or non-word spoken. These include conversational fillers such as eh, ah, uh-huh, unh-uh, uhm, hmm, er, and others. With pure verbatim, people receive a genuine print-copy of the audio they provided exactly as it was heard.

Determining What to Use

Non-verbatim transcripts are the most standard type of transcripts. Most people want audio transcriptions to be direct and easily readable. If the goal is strictly to communicate information, then false starts, filler words, side conversations, and interruptions only distract readers. For instance, a doctor does not need any “ahs” or false starts entered into his patient notes. Or a businesswoman does not need a record of any “uhms” in a transcription of her keynote address. Generally speaking, non-verbatim is a safe, default preference for audio transcription needs.

There are some exceptions, of course, in which users do not want this type of subtle editing. For instance, a market researcher might want smart verbatim transcriptions of interviews or focus groups because he or she wants to denote participants’ thought processes but without unnecessary filler words. Alternatively, an attorney preparing for a court case may want pure verbatim transcriptions of interviews that include all filler words. These words and the frequency of their use can imply hesitation to share information, lack of understanding, or a poor recollection of events. Likewise, an academic researcher specializing in one of the social sciences may find that it is important to capture every nuance of people’s responses when conducting a study.

In short, the purpose of the audio transcription will usually help determine which style of transcription should be chosen. Having three style options expands the applications of audio transcriptions.

Human Audio Transcription vs. Automated Audio to Text

Once people have settled on using audio transcriptions, they must determine if they will choose a transcription service that uses human transcribers, a service that uses audio to text converters, or if they will opt for automated audio to text conversion to use on their own. Naturally, each solution involves some tradeoffs and people must determine which benefits they want to prioritize.

Human Transcriber Pros Human Transcriber Cons
  • Achieves the highest level of accuracy possible in audio transcriptions
  • Greater ability to accurately transcribe audio content that has a mix of accents
  • Greater ability to accurately transcribe audio with a high proportion of industry-specific terminology
  • Greater flexibility with various transcription and audio formats
  • May cost more than getting machine-generated transcripts
  • Slightly slower turnaround times
  • Some loss of control over workflow for people who prefer DIY solutions
Automated Transcription Pros Automated Transcription Cons
  • Lower long-term costs for downloadable software programs, particularly if high levels of accuracy are not demanded
  • Automated web-based services may offer a cost compromise between cheap downloadable software and human transcribers
  • Keep greater control over own workflow
  • Have the flexibility to either self-edit or add the cost of a transcription editor
  • Cannot achieve the level of accuracy that a human transcriber can
  • Inaccurate transcripts may need more editing which is time-consuming
  • More likely to struggle with accents and jargon-heavy content
  • May be more difficult to convert less common soft copy formats such as .aif or .flv
  • Limited availability in different languages

Determining What to Use

Choosing how to generate audio transcriptions ultimately comes down to what is most important to you. For instance, accuracy is of the utmost importance in legal cases and a downloadable software program is unlikely to generate an acceptable degree of accuracy. The extra time a lawyer spends editing it will be an inefficient use of her resources and a human transcriber would ultimately be a lower cost option.

In contrast, a pastor that wants to share and archive his sermons with audio transcriptions may accept a lower level of accuracy because his budget is more constrained. Having spoken the entirety of the audio content himself, he may find that it is easier for him to edit an automated transcription.

Automated web-based platforms and human transcription services both cater to people who value convenience. Both types of services offer apps which can be downloaded and used to record content and then automatically submit it for transcription. However, some people may value the additional peace of mind of having a responsible employee handle their content and provide updates as opposed to an automated system.


Audio transcriptions have been available for a long time, but continued improvements in audio technology and transcriptions’ multi-faceted applications have kept them a staple of professionals in a wide range of fields. They can just as easily meet the needs of an independent podcaster with a small budget as they can an insurance company with high demands for precision. Their far-reaching contributions to efficiency, accessibility, searchability, and documentation mean that they can surely enhance your work too.

Free Recording Service