Transcribe a file
Already have a recording — an interview, a lecture, a podcast, a voice memo, a downloaded meeting? InkSpoke can transcribe it for you and produce the same speaker-labeled transcript you get from a live meeting, without you having to play it back in real time.
Under the hood, file import runs the exact same long-form engine as meeting recording. The only difference is where the audio comes from: instead of your mic and system audio, it's a file you already have on disk.
When to use it
Use Transcribe a file whenever the audio already exists and you just want the text:
- Turn a recorded interview or call into a transcript.
- Get captions (
.srt/.vtt) for a video. - Transcribe a lecture, webinar, or podcast episode.
- Re-process a file with different settings — a different language, model, or speaker count.
If you want InkSpoke to capture a live meeting from your microphone and the other participants' audio, use Record a meeting instead.
Open the dialog
There are two ways in — both open the same Transcribe a file window:
- From the tray menu: click the InkSpoke system-tray icon and choose Transcribe a File…
- From History: open Settings → History (or the tray History… item) and use the Transcribe a File action there.
There is no dedicated hotkey for file import — it's always started from the tray menu or History.
┌────────────────────────────────────────────────┐
│ Transcribe a file ✕ │
├────────────────────────────────────────────────┤
│ File: [ Choose file… ] interview.m4a │
│ │
│ [ Workspace ▾ ] [ Language ▾ ] │
│ │
│ Transcription model │
│ [ Whisper Small · on-device ▾ ] │
│ │
│ [x] Identify speakers (diarization) │
│ Speakers [ Auto-detect ▾ ] │
│ │
│ ▓▓▓▓▓▓▓▓▓░░░░░░░ Transcribing… 42% │
│ │
│ [ Cancel ] [ Transcribe ] │
└────────────────────────── ──────────────────────┘
Step by step
- Choose a file. Click Choose file… and pick your audio or video file (see supported formats below).
- Pick a workspace. The workspace you choose applies its vocabulary and dictionary substitutions to the transcript, so domain terms and names come out spelled the way you want. Leave it on your default if you're not sure.
- Set the language. Choose the spoken language, or leave it on Auto to let the model detect it.
- Pick a transcription model. Defaults to your active on-device Whisper model. You can switch to a larger on-device model or, if you've configured one, a cloud diarized model (see Local vs. cloud).
- Choose whether to identify speakers. The Identify speakers (diarization) checkbox is on by default. Leave it on to get a speaker-labeled transcript; turn it off to treat the whole recording as a single "Speaker".
- (Optional) Set the speaker count. When diarization is on and you're using an on-device model, you can leave Speakers on Auto-detect or pin it to a fixed number from 1 to 8 if you already know how many people are talking.
- Click Transcribe. A progress bar and status text track the work. Cancel aborts at any time.
When it finishes, the transcript opens automatically in the Transcript Viewer, where you can rename speakers, replay audio (if kept), copy the text, or export it. See Speakers and the transcript viewer.
What happens after you click Transcribe
InkSpoke first probes the file for a decodable audio stream, then uses a bundled FFmpeg to decode it to 16 kHz mono audio — the same format the transcription engine expects. Because FFmpeg is bundled with InkSpoke, the common formats below just work without you installing anything.
Supported formats
InkSpoke can read a wide range of containers. For video files, it extracts and transcribes the audio track.
| Type | Formats |
|---|---|
| Audio | .mp3 · .wav · .m4a · .ogg · .opus · .flac · .aac · .wma |
| Video (audio track) | .mp4 · .webm · .avi · .mkv · .mov |
File import needs the bundled FFmpeg/FFprobe helper, which ships with InkSpoke. If your build can't find it and it isn't on your system PATH, decoding will fail — reinstall to restore the bundled copy.
The dialog options
| Option | Choices | What it does |
|---|---|---|
| File | Any supported file | The audio/video to transcribe. Your original file is never modified or moved — InkSpoke just reads it. |
| Workspace | Any of your workspaces | Applies that workspace's vocabulary and personal-dictionary substitutions to the result. |
| Language | Auto, or a specific language | The spoken language. Auto lets the model detect it. |
| Transcription model | On-device Whisper sizes; a cloud diarized option when configured | Which engine transcribes. Defaults to your active on-device Whisper model. |
| Identify speakers (diarization) | On (default) / Off | Labels who spoke when. Off collapses everything into one "Speaker". |
| Speakers | Auto-detect, or 1–8 | For on-device diarization only: fix the speaker count or let InkSpoke detect it. Hidden when you pick a cloud model (it auto-detects). |
These pickers apply to this one import only — nothing here is saved as a default. Global recording defaults live in Settings → Recordings & Meetings.
Local vs. cloud engines
The Transcription model picker decides where the work runs.
| On-device (Whisper) | Cloud (diarized model) | |
|---|---|---|
| Where it runs | Fully on your machine | Uploaded to a cloud provider |
| Privacy | Audio never leaves your device | Compressed audio is sent to the provider |
| Network | Works offline | Requires an internet connection |
| Speaker count | Auto-detect, or fix it 1–8 | Auto-detected by the provider |
| Length limit | Handles very long files | ~25 MB of compressed upload |
| When it appears | Always (the default) | Only when a diarize-capable cloud model is configured |
On-device is the default and needs no setup — it chunks the audio through Whisper (with your workspace vocabulary applied) and labels speakers locally. If you pick a Whisper size you haven't downloaded yet, InkSpoke fetches it on demand before starting.
Cloud appears in the model list as an option ending in "· auto speakers (cloud)", and only when you've configured a diarize-capable cloud model — either your own key (BYOK) or the InkSpoke Platform. It compresses the mixed audio and sends it to a diarized-transcription model, which returns the transcript, speaker labels, and timestamps with the speaker count detected automatically.
The cloud engine accepts up to about 25 MB of compressed audio per file. Longer recordings are rejected with a message asking you to use a local model instead. For a multi-hour interview or lecture, choose an on-device Whisper model — it has no such limit.
On-device speaker identification uses two small ML models (about 46.5 MB total) that download on demand the first time you use diarization. You can also fetch them ahead of time from Settings → Recordings & Meetings → Speakers → Download diarization model. Once downloaded, speaker identification runs fully on-device. If diarization is off, the whole file becomes a single "Speaker".
Re-transcribing a file
Not happy with the result — wrong language detected, too few speakers, or you want to try the cloud engine? Just run the import again: open Transcribe a File…, pick the same file, and choose different settings. Each run produces its own transcript, so you can compare. Imported recordings keep a reference to their original source file, so re-importing is quick.
Platform notes
File import behaves the same on Windows, macOS, and Linux — same dialog, same formats, same FFmpeg decoding. The one practical difference is transcription speed, which depends on hardware acceleration:
| Platform | On-device acceleration |
|---|---|
| macOS | CoreML |
| Windows | CUDA (NVIDIA GPUs) |
| Linux | CPU |
Cloud transcription speed is the same everywhere since the heavy lifting happens on the provider's servers.
Next steps
- Record a meeting — capture a live meeting from your mic plus the other participants' audio.
- Speakers and the transcript viewer — rename speakers, replay audio, and export to txt, md, srt, vtt, or json.
- Models and providers — download Whisper sizes and set up a cloud transcription model.
- On-device vs. cloud and privacy — understand where your audio is processed.