Local Audio Transcription with MLX Whisper and AI agent API on Apple Silicon
I attend a lot of meetings. Some are in-person, some remote, but almost all of them benefit from having a transcript and summary afterwards. Commercial transcription services work well, but they come with two drawbacks: they cost money per minute, and your audio gets sent to external servers.
So I built a local-first workflow that runs entirely on my MacBook. It uses MLX Whisper for transcription (optimized for Apple Silicon) and optionally your preferred AI agent API (in this case I used the Claude API) for generating summaries and action points. The transcription itself is completely free and offline. The AI tool step is optional and costs a few cents per hour of audio.
The workflow
The pipeline is straightforward:
Audio file (.m4a, .mp3, .wav)
|
v
MLX Whisper (local, Apple Silicon) -- 1-3 min per hour of audio
|
v
Transcript (.md)
|
v
AI agent API (optional) -- ~$0.05 per hour of audio
|
+---> Summary
+---> Action points
+---> Full report
+---> Visual diagrams (Mermaid)
Everything runs from a single Python script with interactive prompts. No configuration files needed.
Why MLX Whisper?
OpenAI’s Whisper is the standard open-source speech recognition model. MLX Whisper is a port that runs natively on Apple Silicon using Apple’s MLX framework. The difference in speed is significant: on an M1 MacBook Pro, a one-hour recording transcribes in about 2 minutes with the medium model. No GPU required, no cloud, no cost.
The model options range from tiny (fast, rough) to large (slower, very accurate):
| Model | Speed | Quality | Best for |
|---|---|---|---|
| tiny | Very fast | Basic | Quick checks, clear speech |
| base | Fast | Decent | Simple recordings |
| small | Good balance | Good | Most recordings |
| medium | Moderate | Very good | Recommended default |
| large | Slower | Excellent | Difficult audio, accents |
For Dutch audio, the medium model hits the sweet spot between speed and accuracy. For English, small often suffices.
Setup
The setup requires three components:
# MLX Whisper for transcription
pip install mlx-whisper
# Anthropic SDK for Claude summaries (optional, and can be interchanged with your preferred AI service)
pip install anthropic
# FFmpeg for audio processing
brew install ffmpeg
If you want AI-powered summaries, set your API key (:
export API_KEY='your-key-here'
How the script works
Running the script presents a series of interactive prompts:
python transcribe_and_summarize.py
# Step 1: Choose language (Dutch, English, German, French, Japanese)
# Step 2: Choose Whisper model (tiny through large)
# Step 3: Choose what to generate:
# - Summary only
# - Action points only
# - Summary + action points
# - Full report
# - Everything (report + summary + actions + visual diagrams)
# - Nothing (transcription only)
# Step 4: Point to audio file
# Step 5: Set output directory
# Step 6: Wait 2-4 minutes
# Step 7: Done
The output is a set of markdown files, one per output type. They work well in any markdown editor, and I drop them straight into my Obsidian vault for reference.
Cost comparison
| Component | Cost | Processing time |
|---|---|---|
| Transcription (MLX Whisper) | Free | 1-3 min per hour |
| Summaries (Claude API) | ~$0.03-0.10 | 10-30 sec per output |
| Total per hour of audio | $0.05-0.15 | 2-4 minutes |
Compare that to commercial services that charge $0.50-2.00 per hour for transcription alone. And your audio stays on your machine.
The privacy option
If you don’t want any data leaving your machine, you can skip the external AI tool step entirely and run the summarization locally with Ollama:
# Transcribe locally (MLX Whisper)
python transcriber_mlx.py
# Summarize locally (Ollama)
python summarize_transcript_local.py
The quality of local summaries depends on your hardware and the model you run, but for meeting notes it’s often good enough. The entire workflow is then 100% offline, 100% free, and 100% private.
Practical tips
For best quality: Use the large model. It takes longer but handles accents, background noise, and overlapping speakers much better.
For speed: Use the small model with summary-only output. Total processing time drops to about 2 minutes for a one-hour recording.
For batch processing: Transcribe all recordings first, then summarize separately. This lets you review transcripts before spending API credits on summaries.
For recurring meetings: Create a dedicated output folder per project or meeting series. Over time you build a searchable archive of everything discussed.
Why this matters
The real value isn’t the transcription itself. It’s what happens when every meeting, interview and brainstorm session becomes searchable text. You can grep through months of meetings. You can ask your CLI AI tool to find all action points from a project’s history. You can settle the “did we discuss this?” question in seconds.
And because it runs locally, there’s no friction. Record the meeting, run the script, move on. The 15-minute setup pays for itself after the first use.