HuggingFace Dataset
Public dataset published at ryan-beliefengines/podcast-transcripts under CC BY-NC 4.0.
Files
| File | Records | Description |
|---|---|---|
transcripts.parquet | 345 | Full episode transcripts with speaker diarization |
transcript_chunks.parquet | 11,546 | Transcript chunks (~512 tokens) with optional embeddings |
episode_metadata.parquet | 345 | Episode metadata (title, speakers, audio URL, published date) |
Schemas
transcripts.parquet
| Column | Type | Description |
|---|---|---|
episode_id | string | Unique episode identifier |
podcast_slug | string | Podcast name slug |
episode_slug | string | Episode name slug |
episode_title | string | Episode title |
published_at | string | Publication date |
duration_seconds | int | Episode duration |
transcript_text | string | Full transcript text |
segments | list | Speaker-diarized segments |
transcript_chunks.parquet
| Column | Type | Description |
|---|---|---|
chunk_id | string | Unique chunk identifier |
episode_id | string | Parent episode ID |
text | string | Chunk text (~512 tokens) |
timestamp_start | float | Start time in seconds |
timestamp_end | float | End time in seconds |
speakers | list[string] | Speakers in this chunk |
primary_speaker | string | Main speaker in chunk |
embedding | list[float] | 1,536-dim embedding (optional, text-embedding-3-large) |
episode_metadata.parquet
| Column | Type | Description |
|---|---|---|
episode_id | string | Unique episode identifier |
title | string | Episode title |
podcast_slug | string | Podcast slug |
duration_seconds | int | Episode duration |
speaker_slugs | list[string] | All speakers in episode |
audio_url | string | Original audio URL |
Usage
from datasets import load_dataset
# Load full transcripts
ds = load_dataset("ryan-beliefengines/podcast-transcripts", data_files="data/transcripts.parquet")
# Load chunks with embeddings
chunks = load_dataset("ryan-beliefengines/podcast-transcripts", data_files="data/transcript_chunks.parquet")
# Load metadata
meta = load_dataset("ryan-beliefengines/podcast-transcripts", data_files="data/episode_metadata.parquet")
Export Pipeline
The dataset is built by scripts/build_hf_dataset.py, which orchestrates five steps:
- Speakers backfill — Ensure all episodes have speaker labels
- Export transcripts — Full transcripts →
transcripts.parquet - Export chunks — Chunked transcripts with embeddings →
transcript_chunks.parquet - Export metadata — Episode metadata →
episode_metadata.parquet - Validate — Cross-check episode IDs match across all files
- Upload — Push to HuggingFace via
huggingface_hubAPI
Commands
# Full build and upload
uv run python scripts/build_hf_dataset.py
# Skip embedding costs
uv run python scripts/build_hf_dataset.py --skip-embeddings
# Dry run (cost estimate only)
uv run python scripts/build_hf_dataset.py --dry-run
# Export only (skip speaker backfill)
uv run python scripts/build_hf_dataset.py --skip-backfill
# Upload only (after manual export)
uv run python scripts/upload_to_hf.py --repo ryan-beliefengines/podcast-transcripts
Environment Variables
| Variable | Description |
|---|---|
HF_BITCOINOLOGY_WRITE | HuggingFace write token (preferred) |
HF_TOKEN | HuggingFace token (fallback) |
Update Frequency
Currently exported on-demand. Target: automated monthly export after pipeline backfill completes.