Experiment Glue data staging #60

victorskl · 2025-03-11T03:57:55Z

This task aims as experimentation and, to acquire understanding on warehouse data staging activity via Glue.

Items:

Please follow the Glue local development setup - https://github.com/umccr/orcahouse/tree/main/infra/glue
- To understand basic about PySpark API and, how it process the data, Spark dataframe & its data parallelism concept, etc.
Please exercise to see whether you can reuse skel template, to say, if you would process and stage another data source for warehouse
Observe terraform from deploy directory and its Glue ETL scripts deployment in prod account (try invoke the job run, etc - no worries, ETL scripts are idempotent)
Observe existing Glue script on spreadsheet processing
- In current Glue ETL script; prior loading into the warehouse staging database, investigate whether we can also perform "datalake" the output data (processed or unprocessed/source as-is or etc)? Historical, archival, etc.
- Perhaps, better yet, S3 Table bucket with Iceberg over datalake?
- To discuss pros/cons and, use when it fits the use case, etc
Choose Python or Scala for the activity

The text was updated successfully, but these errors were encountered:

victorskl mentioned this issue Mar 13, 2025

Implement data staging from DynamoDB #63

Open

Provide feedback