8000 GitHub - carlospadron/etl: etl examples
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

carlospadron/etl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ETL

Comparison of techs to perform ETL. The idea is simple, read a dataset from postgres, replicate it on another postres database and evaluate time and memory consupmtion. The detailed statistics can be found in results.ipynb.

Data

OS Open UPRN https://osdatahub.os.uk/downloads/open/OpenUPRN

full count: 41,011,955 test count: 2,000,000

Highligts

pg_dump/pg_restore

  • For postgresql to postgresql is the most efficient tool.
  • It is not possible to transform the data.

Sling

  • Great for replications as it includes many inbuild features (retries, streaming etc)
  • it has a very low memory impact
  • it is not as fast as other solutions

DuckDB

  • is a winner (in terms of execution time) for both small and large datasets
  • it is not distributed so it might struggle with very large datasets
  • it is mostly sql based. Familiar for many but might have limitations.

Spark

  • handles well memory for both small and large datasets
  • not as fast as duckdb for these tests
  • it is distributed so it can handle very large datasets (Terabytes and more)
  • allows SQL, python and scala
  • It also has machine learning and graph theory capabilities

Polars

  • Very efficient compared to Pandas and for small datasets competes well against spark.
  • Very similar to pandas.

Setup

Initial data upload

run upload:

cd data
sh initial_upload.sh

Create databases

origin: postgres target: target

create target:

createdb target
0