Data

Datasets, schemas, storage architecture, and data dictionaries for the Belief Engines ecosystem.

Architecture

Data Lifecycle

1. Ingestion (be-flow-dtd)

Runs 24/7 on bare metal with GPU acceleration. Processes podcast audio into structured transcripts.

Step	Tool	Output
Download	RSS parser	Raw audio files
Transcribe	Whisper large-v3	Speech-to-text
Diarize	Pyannote 3.1	Speaker turn boundaries
Speaker ID	ECAPA-TDNN	Voice-matched speaker labels

Output lands in two Supabase tables: transcript_chunks and episode_metadata.

2. ETL (be-podcast-etl)

Pulls transcripts from Supabase and runs two sequential pipelines:

Episode pipeline (10 stages):

speakers → ads → extract → abstract → embed → weights → headlines → matrix → clips → trust_score

Person pipeline (6 stages):

wiki_enrich → sprite → person_matrix → person_embed → build_index → build_viz

Pipeline progress tracked via JSON manifests in data/runs/manifests/. See Pipeline Overview for stage details.

3. Storage

All file I/O goes through a storage abstraction layer (Storage ABC). Three backends:

Backend	Use Case	Reads	Writes
DualStorage (recommended)	Production	Local (fast)	Local + Supabase
LocalStorage	Development	Local	Local
SupabaseStorage	Docker / no disk	Supabase	Supabase

DualStorage gives local speed with automatic cloud backup. Supabase write failures are logged but never block the pipeline. See Storage Architecture for full details.

4. Export

Transcript data exported to HuggingFace as 3 parquet files via scripts/build_hf_dataset.py. See HuggingFace Dataset for schemas and commands.

5. Serving

System	Role	Data Served
Supabase Postgres + pgvector	Structured queries, full-text search	Transcripts, episode metadata, beliefs
Qdrant	Vector similarity search	Belief embeddings, person embeddings
Local / Dual storage	Pipeline artifacts	Manifests, raw JSON, sprites, indexes

Key Data Artifacts

Artifact	Source	Format	Description
Raw transcripts	be-flow-dtd	JSON (Supabase)	Speaker-diarized transcript chunks
Structured beliefs	ETL `extract` → `abstract`	JSON (sharded)	8-layer abstraction L0-L7 with 10-dim positioning vectors
Person profiles	ETL person pipeline	JSON per person	Trust badges, domain scores, bios, top quotes
8-bit sprites	ETL `sprite` stage	PNG per person	NES-style pixel art avatars
Semantic embeddings	ETL `embed` + `person_embed`	Qdrant collections	1,536-dim via `text-embedding-3-large`
Search indexes	ETL `build_index` + `build_viz`	JSON	Person search index, similarity data, 3D viz coordinates
HuggingFace dataset	Export pipeline	3 parquet files	`transcripts`, `transcript_chunks`, `episode_metadata`
Pipeline manifests	ETL runtime	JSON per episode	Stage completion status and timestamps

Repositories

Repo	Role	Produces	Runs On
`be-flow-dtd`	Transcription pipeline	Raw transcripts → Supabase	Bare metal GPU (24/7)
`be-podcast-etl`	ETL pipeline	Beliefs, persons, embeddings, HF exports	Local / CI
`be-web`	API layer	Nothing (read-only)	Vercel

Current Scale

Metric	Value
Transcript chunks	11,546
Episodes indexed	345
Extracted beliefs	596
Speakers profiled	24
Embedding dimensions	1,536
Belief abstraction layers	8 (L0-L7)
Positioning vector dimensions	10
Trust badge tiers	4 (bronze → platinum)

Data Sections

Page	What It Covers
Belief Schema	8-layer belief model (L0 raw quote → L7 positioning vector)
Person Schema	Person profiles, trust badges, domain scores
Storage Architecture	Storage ABC, local/dual/supabase backends, manifests
HuggingFace Dataset	Public dataset, parquet schemas, export pipeline

Quick Links

Supabase schema: Data Model (ER diagram, 12 tables, 16 RPC functions)
Pipeline stages: Pipeline Overview (10-stage episode + 6-stage person)
Vector search: HNSW index on halfvec(1536) with cosine similarity
Public dataset: ryan-beliefengines/podcast-transcripts (CC BY-NC 4.0)

Architecture​

Data Lifecycle​

1. Ingestion (be-flow-dtd)​

2. ETL (be-podcast-etl)​

3. Storage​

4. Export​

5. Serving​

Key Data Artifacts​

Repositories​

Current Scale​

Data Sections​

Quick Links​