Speakers and the transcript viewer
Once a meeting or an imported file finishes transcribing, InkSpoke shows you a speaker-labeled transcript in the Transcript Viewer — where you can rename speakers, replay the audio, copy the text, and export to several formats. This page explains how the labeling works and everything you can do once the transcript is in front of you.
If you haven't captured anything yet, start with Record a meeting or Transcribe a file.
How speaker labeling works
"Diarization" is just a fancy word for figuring out who spoke when. InkSpoke does this in two ways and picks whichever fits the recording — all on your own device.
- Channel attribution (meetings). A meeting records your microphone and the computer's system audio on two separate tracks. That makes labeling easy: anything from your mic becomes You, and anything coming through the speakers becomes Participants. It's fast, reliable, and needs no extra model.
- On-device ML diarization (single-track audio). An imported file is a single mixed track, so there's no "channel" to split on. Instead, an optional on-device machine-learning model listens to the voice fingerprints and clusters the audio into distinct speakers. This needs a one-time model download (see below).
InkSpoke combines both into a hybrid strategy: it uses the ML diarizer when its models are installed, falls back to channel attribution for two-track meetings, and — if it can't find distinct voices — labels the whole recording as a single Speaker. Turn diarization off and everything is attributed to one Speaker as well.
If you import a file with a diarize-capable cloud model, the speaker splitting happens on the provider's side and speaker count is auto-detected. Live meetings always transcribe and diarize on-device — see On-device vs. cloud.
Download the diarization model
The ML diarizer relies on two small on-device models that aren't bundled by default. Grab them once and speaker identification runs entirely offline forever after.
- Open Settings → Recordings & Meetings.
- In the Speakers section, click Download diarization model.
- Watch the progress bar. When it's done, you'll see "Speaker identification runs fully on-device."
| Model | Purpose | Size |
|---|---|---|
| pyannote segmentation | Finds where speech turns start and end | ~6.96 MB |
| 3D-Speaker (ERes2Net) embedding | Turns each voice into a fingerprint for clustering | ~39.6 MB |
| Total | ~46.5 MB |
Because a meeting already separates you from everyone else by channel, you'll get You / Participants labels even without the ML model. Download it when you want finer per-person splitting — most useful for imported files, where there's only one track to work from.
Speaker count. When you transcribe a file, you can set the number of speakers (Auto-detect, or 1–8) for the local diarizer. Live meetings always auto-detect, and so does cloud diarization.
The Transcript Viewer
The Transcript Viewer opens automatically when a meeting or import finishes. You can reopen any past transcript from Settings → History (or the tray History… item) by clicking its row — see History and diagnostics.
┌────────────────────────────────────────────────────────────┐
│ Weekly sync 0:42:17 · English │
│ [ ▶ Play ] │
├────────────────────────────────────────────────────────────┤
│ Speakers │
│ [ You ] [ Participants ] │
├────────────────────────────────────────────────────────────┤
│ You Morning — can everyone hear me okay? │
│ Participants Yep, loud and clear. │
│ You Great, let's start with the roadmap. │
│ Participants I had one question about the timeline… │
│ ⋮ │
├───────────────── ───────────────────────────────────────────┤
│ [ Copy ] [ Export VTT ] [ Other formats ▾ ] │
└────────────────────────────────────────────────────────────┘
The viewer has four parts:
- Header — the recording's title, its duration (
h:mm:ss), and the detected language. A Play / Stop button appears here only when the audio was kept (see Keeping audio below); it toggles playback of the saved recording. - Speaker roster — an editable list of every speaker label. Type a new name into any box (for example, rename Participants to Priya) and it updates all of that speaker's segments live and saves automatically.
- Segment list — the transcript itself, one line per turn: a speaker label plus what they said, scrollable top to bottom.
- Export footer — Copy, Export VTT, and an Other formats ▾ menu (covered next).
Renaming a speaker isn't cosmetic — it rewrites that label everywhere in the transcript and persists to the saved recording, so your exports and future re-opens carry the real names.
Export and copy
From the footer you can copy the transcript or save it in any of five formats. Copy puts a plain-text version (Speaker: text lines) on your clipboard. Export VTT is the default button; the rest live under Other formats ▾. The export filename is derived from the recording's title.
| Format | Extension | Timing | Best for |
|---|---|---|---|
| WebVTT (default) | .vtt | HH:MM:SS.mmm | Web video captions |
| SubRip | .srt | HH:MM:SS,mmm | Standard subtitle files for video players |
| Plain text | .txt | — | Quick paste, Speaker: text lines |
| Markdown | .md | Clock timestamps | Notes and docs, with a title and bold speaker names |
| JSON | .json | Start/end per segment | Feeding another tool: ids, speakers (with origin), and per-segment timestamps, confidence, and overlap flags |
For dropping notes into a doc or wiki, choose Markdown. For adding captions to a recording of the call, choose SubRip (.srt) or WebVTT (.vtt). For piping into your own script or spreadsheet, choose JSON — it's the only format that preserves per-segment timing, confidence, and speaker origin.
Keeping and pruning recordings
Whether you can replay audio in the viewer — and how long transcripts stick around — comes down to two settings under Settings → Recordings & Meetings → Storage.
| Setting | Default | What it does |
|---|---|---|
| Keep recorded audio | Off | Saves the mixed meeting audio alongside the transcript so you can replay it (this is what makes the Play button appear). InkSpoke writes the audio before transcribing, so even a failed transcription keeps the recording. |
| Keep recordings for | Keep forever | Auto-deletes recordings older than the chosen window. Audio files are moved to your system trash. Applied at startup and immediately when a new recording is saved. |
The Keep recordings for picker offers: Keep forever, 1 week, 1 month, 3 months, 6 months, and 1 year.
File imports always remember the path to your original source file, regardless of the Keep-audio setting — InkSpoke doesn't copy the media, it just points back to it. If you keep no audio for a meeting, only the transcript is stored and the Play button won't show.
Re-running a transcript
Sometimes a transcript comes out wrong — the wrong language was detected, or a meeting failed to process. A failed meeting is saved as a Failed row in History rather than disappearing.
To get a fresh transcript, re-import the source audio: open Transcribe a file and run it again, this time picking the right language, model, or speaker count. Because file imports remember the original file, re-importing is the reliable path.
There's no confirmed in-app "re-transcribe this meeting" button in the current build. If you want to re-run a meeting, re-import its audio through Transcribe a File. This is why turning on Keep recorded audio before an important call is worth it — without a saved file there's nothing to re-run.
Platform notes
Speaker identification runs fully on-device on Windows, macOS, and Linux once the diarization model is downloaded — nothing about who-said-what is sent anywhere. The Transcript Viewer, renaming, playback, and every export format behave identically across all three platforms. (Where the underlying audio comes from — system-audio capture permissions and the like — is covered in Record a meeting.)
Next steps
- Record a meeting — capture your mic plus everyone else's audio.
- Transcribe a file — turn an existing audio or video file into a transcript, and re-run one.
- History and diagnostics — find, filter, and reopen every saved transcript.
- On-device vs. cloud and privacy — where your audio and text are processed.