Data
Datasets, schemas, storage architecture, and data dictionaries for the Belief Engines ecosystem.
Architecture
Data Lifecycle
1. Ingestion (be-flow-dtd)
Runs 24/7 on bare metal with GPU acceleration. Processes podcast audio into structured transcripts.
| Step | Tool | Output |
|---|---|---|
| Download | RSS parser | Raw audio files |
| Transcribe | Whisper large-v3 | Speech-to-text |
| Diarize | Pyannote 3.1 | Speaker turn boundaries |
| Speaker ID | ECAPA-TDNN | Voice-matched speaker labels |
Output lands in two Supabase tables: transcript_chunks and episode_metadata.
2. ETL (be-podcast-etl)
Pulls transcripts from Supabase and runs two sequential pipelines:
Episode pipeline (10 stages):
speakers → ads → extract → abstract → embed → weights → headlines → matrix → clips → trust_score
Person pipeline (6 stages):
wiki_enrich → sprite → person_matrix → person_embed → build_index → build_viz
Pipeline progress tracked via JSON manifests in data/runs/manifests/. See Pipeline Overview for stage details.
3. Storage
All file I/O goes through a storage abstraction layer (Storage ABC). Three backends:
| Backend | Use Case | Reads | Writes |
|---|---|---|---|
| DualStorage (recommended) | Production | Local (fast) | Local + Supabase |
| LocalStorage | Development | Local | Local |
| SupabaseStorage | Docker / no disk | Supabase | Supabase |
DualStorage gives local speed with automatic cloud backup. Supabase write failures are logged but never block the pipeline. See Storage Architecture for full details.
4. Export
Transcript data exported to HuggingFace as 3 parquet files via scripts/build_hf_dataset.py. See HuggingFace Dataset for schemas and commands.
5. Serving
| System | Role | Data Served |
|---|---|---|
| Supabase Postgres + pgvector | Structured queries, full-text search | Transcripts, episode metadata, beliefs |
| Qdrant | Vector similarity search | Belief embeddings, person embeddings |
| Local / Dual storage | Pipeline artifacts | Manifests, raw JSON, sprites, indexes |
Key Data Artifacts
| Artifact | Source | Format | Description |
|---|---|---|---|
| Raw transcripts | be-flow-dtd | JSON (Supabase) | Speaker-diarized transcript chunks |
| Structured beliefs | ETL extract → abstract | JSON (sharded) | 8-layer abstraction L0-L7 with 10-dim positioning vectors |
| Person profiles | ETL person pipeline | JSON per person | Trust badges, domain scores, bios, top quotes |
| 8-bit sprites | ETL sprite stage | PNG per person | NES-style pixel art avatars |
| Semantic embeddings | ETL embed + person_embed | Qdrant collections | 1,536-dim via text-embedding-3-large |
| Search indexes | ETL build_index + build_viz | JSON | Person search index, similarity data, 3D viz coordinates |
| HuggingFace dataset | Export pipeline | 3 parquet files | transcripts, transcript_chunks, episode_metadata |
| Pipeline manifests | ETL runtime | JSON per episode | Stage completion status and timestamps |
Repositories
| Repo | Role | Produces | Runs On |
|---|---|---|---|
be-flow-dtd | Transcription pipeline | Raw transcripts → Supabase | Bare metal GPU (24/7) |
be-podcast-etl | ETL pipeline | Beliefs, persons, embeddings, HF exports | Local / CI |
be-web | API layer | Nothing (read-only) | Vercel |
Current Scale
| Metric | Value |
|---|---|
| Transcript chunks | 11,546 |
| Episodes indexed | 345 |
| Extracted beliefs | 596 |
| Speakers profiled | 24 |
| Embedding dimensions | 1,536 |
| Belief abstraction layers | 8 (L0-L7) |
| Positioning vector dimensions | 10 |
| Trust badge tiers | 4 (bronze → platinum) |
Data Sections
| Page | What It Covers |
|---|---|
| Belief Schema | 8-layer belief model (L0 raw quote → L7 positioning vector) |
| Person Schema | Person profiles, trust badges, domain scores |
| Storage Architecture | Storage ABC, local/dual/supabase backends, manifests |
| HuggingFace Dataset | Public dataset, parquet schemas, export pipeline |
Quick Links
- Supabase schema: Data Model (ER diagram, 12 tables, 16 RPC functions)
- Pipeline stages: Pipeline Overview (10-stage episode + 6-stage person)
- Vector search: HNSW index on
halfvec(1536)with cosine similarity - Public dataset: ryan-beliefengines/podcast-transcripts (CC BY-NC 4.0)