8000 GitHub - Veatec22/evidence_test
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Veatec22/evidence_test

Repository files navigation

You're on the right track, and what you're describing is a modern, reproducible, cost-aware data analytics pipeline. Let's refine and structure it step-by-step to ensure it's robust, automatable, and production-ready. 🧠 Your High-Level Plan (Refined)

Here's your concept, revised into a clean, reliable pipeline:

  1. Daily Extraction: BigQuery → Parquet or DuckDB

    Run daily query jobs to extract data from BigQuery

    Save to compressed Parquet (e.g., gzip, snappy) or .duckdb

    Store locally (for dev) and/or push to:

     Google Cloud Storage (for backup + reproducibility)
    
     Optionally commit small files to GitHub (via DVC or Git LFS)
    
  2. Versioned Storage & Tracking: DVC

    Use DVC to:

     Track changes to the data files
    
     Push data to GCS remote storage
    
     Link data version to specific code/report commits
    

    Avoid storing large files directly in GitHub

  3. Local Analysis with DuckDB

    Load Parquet or .duckdb into DuckDB

    Join, filter, explore locally without re-querying BigQuery

    Optionally define logic in dbt-duckdb if models become complex

  4. Reporting with Evidence

    Connect Evidence to .duckdb or query read_parquet() directly

    Build SQL+Markdown reports with variables and filters

    Version all reports and data models in Git

  5. Deployment to GCP (Optional Automation/Serving)

    Trigger daily jobs (BQ → Parquet → GCS) via:

     Cloud Functions
    
     Cloud Scheduler + Cloud Run / Composer
    

    Host Evidence:

     Deploy to Cloud Run (good for lightweight services)
    
     Store static reports to Cloud Storage + CDN (if pre-rendered)
    
     Use GitHub Actions to regenerate + deploy reports on push
    

✅ Revised Pipeline (with Tech Choices) 🔄 Ingestion + Transformation Step Tool Notes Query BigQuery google-cloud-bigquery, SQL Use parameterized queries, incremental filters (e.g., WHERE date > @last_day) Save Locally pandas.to_parquet() or duckdb.register() Save to data/ folder Version Data DVC Store metadata in Git, push data to GCS 🔍 Analysis Step Tool Notes Explore / Join / Filter DuckDB Fast, local SQL Optional Modeling dbt-duckdb Structured SQL transformations, testable 📊 Reporting Step Tool Notes Report Development Evidence.dev Markdown + SQL, supports DuckDB Data Source .duckdb file or read_parquet() Load from data/ directory tracked by DVC ☁️ Deployment Step Tool Notes Automate Daily Jobs Cloud Scheduler → Cloud Run / Function Call Python script to re-query and re-save Upload to GCS gsutil, dvc push Store .parquet or .duckdb snapshots Deploy Reports Cloud Run / Static Hosting Deploy full Evidence app or static HTML exports CI/CD GitHub Actions Auto-run pipeline, push reports, update storage 🧩 Optional Enhancements

Dagster or Prefect instead of scripts for orchestration (nicer logs, retries, visual flow)

Great Expectations or Evidently to validate output before reporting

GCS Lifecycle Policies to auto-delete old data

FastAPI + Evidence hybrid for interactive dashboards

🚦 Deployment Diagram (Simplified)

[BigQuery] ↓ (daily query) [Python script or Cloud Function] ↓ [.parquet or .duckdb files] → [GCS + DVC remote] ↓ [DuckDB Analysis] ↔ [Evidence Reports] ↓ ↓ [Push to GitHub] [Deploy to GCP (Cloud Run / Static HTML)]

🧰 Would You Like…

A ready-made project template with:

    extract_config.yaml

    bq_to_parquet.py

    evidence/settings.yaml

    dvc.yaml stage definition?

A minimal cloudbuild.yaml or GitHub Actions CI to deploy Evidence?

Example GCS setup with DVC remote?

Let me know what you'd like next!

npm install npm run sources npm run dev


### Running Repository Recommendations

To generate repository recommendations based on your starred repos and upload them to your Google Sheet, run the following command:

```bash
python scripts/repo_recommender.py

This script will:

  1. Read your starred repositories from the "starred" sheet in your configured Google Sheet.
  2. Analyze your preferred topics and languages.
  3. Search GitHub for similar repositories with at least 1000 stars.
  4. Score and sort these recommendations.
  5. Write the results to a new sheet named "recommendations" in the same Google Sheet workbook.

zalety:

  • totalny version control - wszystko w repo. nie trzeba widoków w bq (chyba), dashobard jest w markdown (vs tableau brak historii)
  • praca w gitlabie - praca w cursorze = ogromny potencjał na poprawę wydajności. z czasem + dokumentacjami LLM mógłby pykać wszystkie raporty praktycznie sam
  • Dowolność w wyglądzie raportu - połowa dokumentacji to opis elementów frontendowych. do przetestowania, ale wygląda jakby z tego można było robić od dashobardów po artykuły
  • SQL heavy - coś w czym team jest najlepszy. Pola kalkulowane same sie będą pisać
  • redukcja kosztów - odchodzi na
  • mega wydajne - do potwierdzenia, ale z opisu wynika że ten silnik jest bardzo dobry (duckdb na pewno jest mocny)

wady:

  • całkowicie nowe spojrzenie na BI - na pewno w dużym stopniu nieintuicyjne. zamiast UI w którym przerzucasz kafelki, musisz pisać i patrzeć w kod.
  • deployment & access - wielka niewiadoma. z listy kompatybilnych deploymentów to chyba tylko cloudflare pages sie zgadza. to sprawdzenia z devopsami, ale to jest chyba ta bramka która mamy np. do bistro,parkingu, whatifa. nawet jeśli by to działało, pizganie się z devopsami on daily basis brzmi jak udręka. potencjalnie musieliby nadać jakiś access administracyjny bo inaczej kicha w chuj

About

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  
0