8000 Release Factor House Local v2.0 · factorhouse/factorhouse-local · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Factor House Local v2.0

Latest
Compare
Choose a tag to compare
@github-actions github-actions released this 09 Jul 01:11
bb3f2d7

Factor House Local v2.0: A Unified Platform with Enhanced Persistence

We are thrilled to announce a significant update to Factor House Local, our suite of pre-configured Docker Compose environments. This release features a major architectural enhancement, merging our previously separate streaming and batch analytics environments into a single, cohesive platform. We also add Apache Hive Metastore to provide a robust, centralized, and persistent catalog for the entire data ecosystem.

Release Highlights

🚀 Consolidation into a Unified Analytics Platform

We have merged the previously separate Flink/Flex and Spark/Analytics environments into a single Unified Analytics Platform. This architecture bridges the gap between real-time stream processing (Apache Flink) and large-scale batch processing (Apache Spark), allowing both engines to operate seamlessly on a shared Apache Iceberg data lakehouse.

  • Benefit: This consolidation provides a more powerful and streamlined development experience. By merging the two stacks, users gain:
    • Simplified Management: A single, integrated platform reduces the complexity and operational overhead of managing, configuring, and running separate environments.
    • A Single Source of Truth: The unified architecture eliminates data silos, allowing both streaming and batch jobs to work on the same data. This enables everything from low-latency event streaming to complex historical analysis on a consistent dataset.
    • Faster, More Realistic Prototyping: Developers can rapidly build and test end-to-end pipelines that more accurately model modern, production-grade data platforms.

🧠 Hive Metastore: The New Unified Catalog

The platform now utilizes Apache Hive Metastore as its central catalog. Backed by a durable PostgreSQL database, the Hive Metastore provides robust, persistent metadata management for the entire ecosystem.

  • Benefit: The Metastore serves as the persistent memory for the analytics platform. It goes far beyond cataloging tables by storing the analytical logic itself, including permanent SQL views that encapsulate complex logic and custom user-defined functions (UDFs) that extend native capabilities. This creates a truly stateful and collaborative environment where analytical assets defined in one session are immediately available to other Flink and Spark jobs, dramatically improving reusability and simplifying development.

💾 Enhanced Flink Reliability with Persistent State

The Flink configuration has been upgraded to ensure better resilience and easier management of streaming jobs. Flink checkpoints and savepoints are now configured to persist directly to MinIO (S3-compatible object storage).

  • Benefit: This ensures robust state management and reliable recovery for Flink applications. Jobs can recover cleanly from failures, and operational management of long-running streaming processes is significantly simplified.

🌊 CDC-Ready Transactional Hub

The PostgreSQL instance serves a dual role: backing the Hive Metastore and acting as a transactional database ready for Change Data Capture (CDC). It is pre-configured with wal_level=logical, enabling real-time streaming of database changes directly into the lakehouse.

  • Benefit: Users can easily prototype near-real-time synchronization between operational databases and the analytics lakehouse.

Core Environments

This release includes the following updated and refined local development stacks:

  • Kafka Development & Monitoring with Kpow: A robust, 3-node Apache Kafka environment including Schema Registry, Kafka Connect, and the Kpow UI/API for enterprise-grade observability and management.
  • Unified Analytics Platform with Flex, Flink, Spark, Iceberg & HMS: A comprehensive lakehouse environment featuring Flink, Spark, Iceberg, Hive Metastore, PostgreSQL (CDC-ready), and MinIO (S3), managed with the Flex UI.
  • Apache Pinot Real-Time OLAP Cluster: A real-time distributed OLAP datastore designed for ultra-low-latency, user-facing analytics and dashboards.
0