Stars
Lakekeeper is an Apache-Licensed, secure, fast and easy to use Apache Iceberg REST Catalog written in Rust.
An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
data load tool (dlt) is an open source Python library that makes data loading easy π οΈ
A flexible distributed key-value database that is optimized for caching and other realtime workloads.
"rsync for cloud storage" - Google Drive, S3, Dropbox, Backblaze B2, One Drive, Swift, Hubic, Wasabi, Google Cloud Storage, Azure Blob, Azure Files, Yandex Files
The easy-to-use open source Business Intelligence and Embedded Analytics tool that lets everyone work with data π
Nessie: Transactional Catalog for Data Lakes with Git-like semantics
Download and generate EPUB of your favorite books from O'Reilly Learning (aka Safari Books Online) library.
The Data Contract Specification Repository
Pulumi - Infrastructure as Code in any programming language π
π¦π Build context-aware reasoning applications
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Stream processing and management platform.
Apache Camel is an open source integration framework that empowers you to quickly and easily integrate various systems consuming or producing data.
The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
πΆ Kubernetes CLI To Manage Your Clusters In Style!
ZenML π: The bridge between ML and Ops. https://zenml.io.
Scripts and samples to support Confluent Demos, Talks, and Blogs. Not all of the examples in this repository are kept up to date. For automated tutorials and QA'd code, see https://github.com/conflβ¦
Hopsworks - Data-Intensive AI platform with a Feature Store
Open Source Feature Flagging and A/B Testing Platform
The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
The fastest β‘οΈ way to build data pipelines. Develop iteratively, deploy anywhere. βοΈ
Always know what to expect from your data.
Modin: Scale your Pandas workflows by changing a single line of code
Example π Jupyter notebooks that demonstrate how to build, train, and deploy machine learning models using π§ Amazon SageMaker.