Data Pipelines
Two offline pipelines feed data into Bitcoinology. Both run independently from the frontend.
Pipeline Architecture
be-flow-dtd (Transcription Pipeline)
Runs on bare metal with GPU acceleration, 24/7.
| Stage | Tool | Purpose |
|---|---|---|
| Download | RSS parser | Fetch podcast audio files |
| Transcribe | Whisper large-v3 | Speech-to-text |
| Diarize | Pyannote 3.1 | Who spoke when |
| Speaker ID | ECAPA-TDNN | Match voices to known speakers |
Output: Structured JSON → transcript_chunks and episode_metadata tables in Supabase.
be-podcast-etl (Belief Extraction)
10-stage pipeline that transforms raw transcripts into structured beliefs:
| Stage | What It Does |
|---|---|
| 1. Speaker Resolution | Map diarization labels to known speakers |
| 2. Ad Removal | Strip sponsor reads and ads |
| 3. Belief Extraction | Extract atomic beliefs (≤25 words) from quotes |
| 4. Worldview Abstraction | Derive worldview and core axiom |
| 5. Embedding Generation | OpenAI text-embedding-3-large (1536-dim) |
| 6. Ideology Weighting | 10-dimensional positioning vector |
| 7. Headlines | Generate tabloid-style headlines |
| 8. Matrix Scoring | Confidence and tier scoring |
| 9. Clip Extraction | Identify audio clip timestamps |
| 10. Trust Scores | Speaker trust badge calculation |
HuggingFace Dataset
Monthly export to ryan-beliefengines/podcast-transcripts:
beliefs.parquet— 596 beliefs with embeddingspersons.parquet— 24 speaker profilesmetadata.json— Dataset stats
License: CC BY-NC 4.0