Local Audio Transcription with MLX Whisper and AI agent API on Apple Silicon

Posted On 24 February 2026

I attend a lot of meetings. Some are in-person, some remote, but almost all of them benefit from having a transcript and summary afterwards. Commercial transcription services work well, but they come with two drawbacks: they cost money per minute, and your audio gets sent to external servers.

So I built a local-first workflow that runs entirely on my MacBook. It uses MLX Whisper for transcription (optimized for Apple Silicon) and optionally your preferred AI agent API (in this case I used the Claude API) for generating summaries and action points. The transcription itself is completely free and offline. The AI tool step is optional and costs a few cents per hour of audio.

The workflow

The pipeline is straightforward:

Audio file (.m4a, .mp3, .wav)
    |
    v
MLX Whisper (local, Apple Silicon)  -- 1-3 min per hour of audio
    |
    v
Transcript (.md)
    |
    v
AI agent API (optional)               -- ~$0.05 per hour of audio
    |
    +---> Summary
    +---> Action points
    +---> Full report
    +---> Visual diagrams (Mermaid)

Everything runs from a single Python script with interactive prompts. No configuration files needed.

Why MLX Whisper?

OpenAI’s Whisper is the standard open-source speech recognition model. MLX Whisper is a port that runs natively on Apple Silicon using Apple’s MLX framework. The difference in speed is significant: on an M1 MacBook Pro, a one-hour recording transcribes in about 2 minutes with the medium model. No GPU required, no cloud, no cost.

The model options range from tiny (fast, rough) to large (slower, very accurate):

Model	Speed	Quality	Best for
tiny	Very fast	Basic	Quick checks, clear speech
base	Fast	Decent	Simple recordings
small	Good balance	Good	Most recordings
medium	Moderate	Very good	Recommended default
large	Slower	Excellent	Difficult audio, accents

For Dutch audio, the medium model hits the sweet spot between speed and accuracy. For English, small often suffices.

Setup

The setup requires three components:

# MLX Whisper for transcription
pip install mlx-whisper

# Anthropic SDK for Claude summaries (optional, and can be interchanged with your preferred AI service)
pip install anthropic

# FFmpeg for audio processing
brew install ffmpeg

If you want AI-powered summaries, set your API key (:

export API_KEY='your-key-here'

How the script works

Running the script presents a series of interactive prompts:

python transcribe_and_summarize.py

# Step 1: Choose language (Dutch, English, German, French, Japanese)
# Step 2: Choose Whisper model (tiny through large)
# Step 3: Choose what to generate:
#          - Summary only
#          - Action points only
#          - Summary + action points
#          - Full report
#          - Everything (report + summary + actions + visual diagrams)
#          - Nothing (transcription only)
# Step 4: Point to audio file
# Step 5: Set output directory
# Step 6: Wait 2-4 minutes
# Step 7: Done

The output is a set of markdown files, one per output type. They work well in any markdown editor, and I drop them straight into my Obsidian vault for reference.

Cost comparison

Component	Cost	Processing time
Transcription (MLX Whisper)	Free	1-3 min per hour
Summaries (Claude API)	~$0.03-0.10	10-30 sec per output
Total per hour of audio	$0.05-0.15	2-4 minutes

Compare that to commercial services that charge $0.50-2.00 per hour for transcription alone. And your audio stays on your machine.

The privacy option

If you don’t want any data leaving your machine, you can skip the external AI tool step entirely and run the summarization locally with Ollama:

# Transcribe locally (MLX Whisper)
python transcriber_mlx.py

# Summarize locally (Ollama)
python summarize_transcript_local.py

The quality of local summaries depends on your hardware and the model you run, but for meeting notes it’s often good enough. The entire workflow is then 100% offline, 100% free, and 100% private.

Practical tips

For best quality: Use the large model. It takes longer but handles accents, background noise, and overlapping speakers much better.

For speed: Use the small model with summary-only output. Total processing time drops to about 2 minutes for a one-hour recording.

For batch processing: Transcribe all recordings first, then summarize separately. This lets you review transcripts before spending API credits on summaries.

For recurring meetings: Create a dedicated output folder per project or meeting series. Over time you build a searchable archive of everything discussed.

Why this matters

The real value isn’t the transcription itself. It’s what happens when every meeting, interview and brainstorm session becomes searchable text. You can grep through months of meetings. You can ask your CLI AI tool to find all action points from a project’s history. You can settle the “did we discuss this?” question in seconds.

And because it runs locally, there’s no friction. Record the meeting, run the script, move on. The 15-minute setup pays for itself after the first use.

Tags:Apple Silicon, Claude API, MLX, Python, Transcription, Whisper

About The Author

hrozema

Add a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Search