Skip to main content

Data Pipelines

Two offline pipelines feed data into Bitcoinology. Both run independently from the frontend.

Pipeline Architecture

be-flow-dtd (Transcription Pipeline)

Runs on bare metal with GPU acceleration, 24/7.

StageToolPurpose
DownloadRSS parserFetch podcast audio files
TranscribeWhisper large-v3Speech-to-text
DiarizePyannote 3.1Who spoke when
Speaker IDECAPA-TDNNMatch voices to known speakers

Output: Structured JSON → transcript_chunks and episode_metadata tables in Supabase.

be-podcast-etl (Belief Extraction)

10-stage pipeline that transforms raw transcripts into structured beliefs:

StageWhat It Does
1. Speaker ResolutionMap diarization labels to known speakers
2. Ad RemovalStrip sponsor reads and ads
3. Belief ExtractionExtract atomic beliefs (≤25 words) from quotes
4. Worldview AbstractionDerive worldview and core axiom
5. Embedding GenerationOpenAI text-embedding-3-large (1536-dim)
6. Ideology Weighting10-dimensional positioning vector
7. HeadlinesGenerate tabloid-style headlines
8. Matrix ScoringConfidence and tier scoring
9. Clip ExtractionIdentify audio clip timestamps
10. Trust ScoresSpeaker trust badge calculation

HuggingFace Dataset

Monthly export to ryan-beliefengines/podcast-transcripts:

  • beliefs.parquet — 596 beliefs with embeddings
  • persons.parquet — 24 speaker profiles
  • metadata.json — Dataset stats

License: CC BY-NC 4.0