You're on the right track, and what you're describing is a modern, reproducible, cost-aware data analytics pipeline. Let's refine and structure it step-by-step to ensure it's robust, automatable, and production-ready. 🧠 Your High-Level Plan (Refined)
Here's your concept, revised into a clean, reliable pipeline:
-
Daily Extraction: BigQuery → Parquet or DuckDB
Run daily query jobs to extract data from BigQuery
Save to compressed Parquet (e.g., gzip, snappy) or .duckdb
Store locally (for dev) and/or push to:
Google Cloud Storage (for backup + reproducibility) Optionally commit small files to GitHub (via DVC or Git LFS)
-
Versioned Storage & Tracking: DVC
Use DVC to:
Track changes to the data files Push data to GCS remote storage Link data version to specific code/report commits
Avoid storing large files directly in GitHub
-
Local Analysis with DuckDB
Load Parquet or .duckdb into DuckDB
Join, filter, explore locally without re-querying BigQuery
Optionally define logic in dbt-duckdb if models become complex
-
Reporting with Evidence
Connect Evidence to .duckdb or query read_parquet() directly
Build SQL+Markdown reports with variables and filters
Version all reports and data models in Git
-
Deployment to GCP (Optional Automation/Serving)
Trigger daily jobs (BQ → Parquet → GCS) via:
Cloud Functions Cloud Scheduler + Cloud Run / Composer
Host Evidence:
Deploy to Cloud Run (good for lightweight services) Store static reports to Cloud Storage + CDN (if pre-rendered) Use GitHub Actions to regenerate + deploy reports on push
✅ Revised Pipeline (with Tech Choices) 🔄 Ingestion + Transformation Step Tool Notes Query BigQuery google-cloud-bigquery, SQL Use parameterized queries, incremental filters (e.g., WHERE date > @last_day) Save Locally pandas.to_parquet() or duckdb.register() Save to data/ folder Version Data DVC Store metadata in Git, push data to GCS 🔍 Analysis Step Tool Notes Explore / Join / Filter DuckDB Fast, local SQL Optional Modeling dbt-duckdb Structured SQL transformations, testable 📊 Reporting Step Tool Notes Report Development Evidence.dev Markdown + SQL, supports DuckDB Data Source .duckdb file or read_parquet() Load from data/ directory tracked by DVC ☁️ Deployment Step Tool Notes Automate Daily Jobs Cloud Scheduler → Cloud Run / Function Call Python script to re-query and re-save Upload to GCS gsutil, dvc push Store .parquet or .duckdb snapshots Deploy Reports Cloud Run / Static Hosting Deploy full Evidence app or static HTML exports CI/CD GitHub Actions Auto-run pipeline, push reports, update storage 🧩 Optional Enhancements
Dagster or Prefect instead of scripts for orchestration (nicer logs, retries, visual flow)
Great Expectations or Evidently to validate output before reporting
GCS Lifecycle Policies to auto-delete old data
FastAPI + Evidence hybrid for interactive dashboards
🚦 Deployment Diagram (Simplified)
[BigQuery] ↓ (daily query) [Python script or Cloud Function] ↓ [.parquet or .duckdb files] → [GCS + DVC remote] ↓ [DuckDB Analysis] ↔ [Evidence Reports] ↓ ↓ [Push to GitHub] [Deploy to GCP (Cloud Run / Static HTML)]
🧰 Would You Like…
A ready-made project template with:
extract_config.yaml
bq_to_parquet.py
evidence/settings.yaml
dvc.yaml stage definition?
A minimal cloudbuild.yaml or GitHub Actions CI to deploy Evidence?
Example GCS setup with DVC remote?
Let me know what you'd like next!
npm install npm run sources npm run dev
### Running Repository Recommendations
To generate repository recommendations based on your starred repos and upload them to your Google Sheet, run the following command:
```bash
python scripts/repo_recommender.py
This script will:
- Read your starred repositories from the "starred" sheet in your configured Google Sheet.
- Analyze your preferred topics and languages.
- Search GitHub for similar repositories with at least 1000 stars.
- Score and sort these recommendations.
- Write the results to a new sheet named "recommendations" in the same Google Sheet workbook.
zalety:
- totalny version control - wszystko w repo. nie trzeba widoków w bq (chyba), dashobard jest w markdown (vs tableau brak historii)
- praca w gitlabie - praca w cursorze = ogromny potencjał na poprawę wydajności. z czasem + dokumentacjami LLM mógłby pykać wszystkie raporty praktycznie sam
- Dowolność w wyglądzie raportu - połowa dokumentacji to opis elementów frontendowych. do przetestowania, ale wygląda jakby z tego można było robić od dashobardów po artykuły
- SQL heavy - coś w czym team jest najlepszy. Pola kalkulowane same sie będą pisać
- redukcja kosztów - odchodzi na
- mega wydajne - do potwierdzenia, ale z opisu wynika że ten silnik jest bardzo dobry (duckdb na pewno jest mocny)
wady:
- całkowicie nowe spojrzenie na BI - na pewno w dużym stopniu nieintuicyjne. zamiast UI w którym przerzucasz kafelki, musisz pisać i patrzeć w kod.
- deployment & access - wielka niewiadoma. z listy kompatybilnych deploymentów to chyba tylko cloudflare pages sie zgadza. to sprawdzenia z devopsami, ale to jest chyba ta bramka która mamy np. do bistro,parkingu, whatifa. nawet jeśli by to działało, pizganie się z devopsami on daily basis brzmi jak udręka. potencjalnie musieliby nadać jakiś access administracyjny bo inaczej kicha w chuj