Skip to main content

Data

Datasets, schemas, storage architecture, and data dictionaries for the Belief Engines ecosystem.

Architecture

Data Lifecycle

1. Ingestion (be-flow-dtd)

Runs 24/7 on bare metal with GPU acceleration. Processes podcast audio into structured transcripts.

StepToolOutput
DownloadRSS parserRaw audio files
TranscribeWhisper large-v3Speech-to-text
DiarizePyannote 3.1Speaker turn boundaries
Speaker IDECAPA-TDNNVoice-matched speaker labels

Output lands in two Supabase tables: transcript_chunks and episode_metadata.

2. ETL (be-podcast-etl)

Pulls transcripts from Supabase and runs two sequential pipelines:

Episode pipeline (10 stages):

speakersadsextractabstractembedweightsheadlinesmatrixclipstrust_score

Person pipeline (6 stages):

wiki_enrichspriteperson_matrixperson_embedbuild_indexbuild_viz

Pipeline progress tracked via JSON manifests in data/runs/manifests/. See Pipeline Overview for stage details.

3. Storage

All file I/O goes through a storage abstraction layer (Storage ABC). Three backends:

BackendUse CaseReadsWrites
DualStorage (recommended)ProductionLocal (fast)Local + Supabase
LocalStorageDevelopmentLocalLocal
SupabaseStorageDocker / no diskSupabaseSupabase

DualStorage gives local speed with automatic cloud backup. Supabase write failures are logged but never block the pipeline. See Storage Architecture for full details.

4. Export

Transcript data exported to HuggingFace as 3 parquet files via scripts/build_hf_dataset.py. See HuggingFace Dataset for schemas and commands.

5. Serving

SystemRoleData Served
Supabase Postgres + pgvectorStructured queries, full-text searchTranscripts, episode metadata, beliefs
QdrantVector similarity searchBelief embeddings, person embeddings
Local / Dual storagePipeline artifactsManifests, raw JSON, sprites, indexes

Key Data Artifacts

ArtifactSourceFormatDescription
Raw transcriptsbe-flow-dtdJSON (Supabase)Speaker-diarized transcript chunks
Structured beliefsETL extractabstractJSON (sharded)8-layer abstraction L0-L7 with 10-dim positioning vectors
Person profilesETL person pipelineJSON per personTrust badges, domain scores, bios, top quotes
8-bit spritesETL sprite stagePNG per personNES-style pixel art avatars
Semantic embeddingsETL embed + person_embedQdrant collections1,536-dim via text-embedding-3-large
Search indexesETL build_index + build_vizJSONPerson search index, similarity data, 3D viz coordinates
HuggingFace datasetExport pipeline3 parquet filestranscripts, transcript_chunks, episode_metadata
Pipeline manifestsETL runtimeJSON per episodeStage completion status and timestamps

Repositories

RepoRoleProducesRuns On
be-flow-dtdTranscription pipelineRaw transcripts → SupabaseBare metal GPU (24/7)
be-podcast-etlETL pipelineBeliefs, persons, embeddings, HF exportsLocal / CI
be-webAPI layerNothing (read-only)Vercel

Current Scale

MetricValue
Transcript chunks11,546
Episodes indexed345
Extracted beliefs596
Speakers profiled24
Embedding dimensions1,536
Belief abstraction layers8 (L0-L7)
Positioning vector dimensions10
Trust badge tiers4 (bronze → platinum)

Data Sections

PageWhat It Covers
Belief Schema8-layer belief model (L0 raw quote → L7 positioning vector)
Person SchemaPerson profiles, trust badges, domain scores
Storage ArchitectureStorage ABC, local/dual/supabase backends, manifests
HuggingFace DatasetPublic dataset, parquet schemas, export pipeline