Skip to main content

HuggingFace Dataset

Public dataset published at ryan-beliefengines/podcast-transcripts under CC BY-NC 4.0.

Files

FileRecordsDescription
transcripts.parquet345Full episode transcripts with speaker diarization
transcript_chunks.parquet11,546Transcript chunks (~512 tokens) with optional embeddings
episode_metadata.parquet345Episode metadata (title, speakers, audio URL, published date)

Schemas

transcripts.parquet

ColumnTypeDescription
episode_idstringUnique episode identifier
podcast_slugstringPodcast name slug
episode_slugstringEpisode name slug
episode_titlestringEpisode title
published_atstringPublication date
duration_secondsintEpisode duration
transcript_textstringFull transcript text
segmentslistSpeaker-diarized segments

transcript_chunks.parquet

ColumnTypeDescription
chunk_idstringUnique chunk identifier
episode_idstringParent episode ID
textstringChunk text (~512 tokens)
timestamp_startfloatStart time in seconds
timestamp_endfloatEnd time in seconds
speakerslist[string]Speakers in this chunk
primary_speakerstringMain speaker in chunk
embeddinglist[float]1,536-dim embedding (optional, text-embedding-3-large)

episode_metadata.parquet

ColumnTypeDescription
episode_idstringUnique episode identifier
titlestringEpisode title
podcast_slugstringPodcast slug
duration_secondsintEpisode duration
speaker_slugslist[string]All speakers in episode
audio_urlstringOriginal audio URL

Usage

from datasets import load_dataset

# Load full transcripts
ds = load_dataset("ryan-beliefengines/podcast-transcripts", data_files="data/transcripts.parquet")

# Load chunks with embeddings
chunks = load_dataset("ryan-beliefengines/podcast-transcripts", data_files="data/transcript_chunks.parquet")

# Load metadata
meta = load_dataset("ryan-beliefengines/podcast-transcripts", data_files="data/episode_metadata.parquet")

Export Pipeline

The dataset is built by scripts/build_hf_dataset.py, which orchestrates five steps:

  1. Speakers backfill — Ensure all episodes have speaker labels
  2. Export transcripts — Full transcripts → transcripts.parquet
  3. Export chunks — Chunked transcripts with embeddings → transcript_chunks.parquet
  4. Export metadata — Episode metadata → episode_metadata.parquet
  5. Validate — Cross-check episode IDs match across all files
  6. Upload — Push to HuggingFace via huggingface_hub API

Commands

# Full build and upload
uv run python scripts/build_hf_dataset.py

# Skip embedding costs
uv run python scripts/build_hf_dataset.py --skip-embeddings

# Dry run (cost estimate only)
uv run python scripts/build_hf_dataset.py --dry-run

# Export only (skip speaker backfill)
uv run python scripts/build_hf_dataset.py --skip-backfill

# Upload only (after manual export)
uv run python scripts/upload_to_hf.py --repo ryan-beliefengines/podcast-transcripts

Environment Variables

VariableDescription
HF_BITCOINOLOGY_WRITEHuggingFace write token (preferred)
HF_TOKENHuggingFace token (fallback)

Update Frequency

Currently exported on-demand. Target: automated monthly export after pipeline backfill completes.