Back to Projectsexperimental

Voice Note Transcriber

An experimental tool that converts voice memos into structured notes with AI-powered summarization and action item extraction.

October 20, 2023

2 min read

Aaron M Sabu

WhisperPythonFastAPIReactOpenAI

Overview

Voice Note Transcriber is an experiment in turning spoken thoughts into organized, actionable notes.

The Experiment

I often record voice memos with ideas, meeting notes, or random thoughts. The problem? They pile up and never get processed. This tool aims to:

Transcribe voice recordings accurately
Structure the content into organized notes
Extract action items and key points
Summarize for quick review

How It Works

1. Transcription

Using OpenAIs Whisper model for accurate speech-to-text:

import whisper

model = whisper.load_model("base")
result = model.transcribe("voice_memo.mp3")
transcript = result["text"]

2. Processing

The transcript is then processed by GPT-4 to:

Correct transcription errors based on context
Add punctuation and formatting
Identify speakers (if multiple)

3. Structuring

The AI organizes content into:

## Summary
Brief overview of the main points

## Key Points
- Point 1
- Point 2
- Point 3

## Action Items
- [ ] Task extracted from the recording
- [ ] Another task

## Raw Transcript
Full transcription for reference

Technical Challenges

Audio Quality

Voice memos are often recorded in noisy environments. Solutions:

Noise reduction preprocessing
Multiple transcription passes
Confidence scoring for uncertain words

Context Understanding

Spoken language is different from written:

Filler words ("um", "uh")
Incomplete sentences
Topic jumping

The AI needs to clean this up while preserving meaning.

Current Status

This is an ongoing experiment. Current capabilities:

Transcription accuracy: ~95% for clear audio
Structure quality: Good for meeting notes, improving for brainstorms
Processing time: ~30 seconds for a 5-minute recording

Future Ideas

Real-time transcription
Mobile app with one-tap recording
Integration with note-taking apps (Notion, Obsidian)
Speaker identification for meetings

What Im Learning

Speech-to-text has come incredibly far
The gap between transcription and understanding is where AI shines
Voice interfaces are underutilized for productivity