8000 Experiment Glue data staging · Issue #60 · umccr/orcahouse · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Experiment Glue data staging #60

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
victorskl opened this issue Mar 11, 2025 · 0 comments
Open

Experiment Glue data staging #60

victorskl opened this issue Mar 11, 2025 · 0 comments

Comments

@victorskl
Copy link
Member
victorskl commented Mar 11, 2025

This task aims as experimentation and, to acquire understanding on warehouse data staging activity via Glue.

Items:

  • Please follow the Glue local development setup - https://github.com/umccr/orcahouse/tree/main/infra/glue
    • To understand basic about PySpark API and, how it process the data, Spark dataframe & its data parallelism concept, etc.
  • Please exercise to see whether you can reuse skel template, to say, if you would process and stage another data source for warehouse
  • Observe terraform from deploy directory and its Glue ETL scripts deployment in prod account (try invoke the job run, etc - no worries, ETL scripts are idempotent)
  • Observe existing Glue script on spreadsheet processing
    • In current Glue ETL script; prior loading into the warehouse staging database, investigate whether we can also perform "datalake" the output data (processed or unprocessed/source as-is or etc)? Historical, archival, etc.
    • Perhaps, better yet, S3 Table bucket with Iceberg over datalake?
    • To discuss pros/cons and, use when it fits the use case, etc
  • Choose Python or Scala for the activity
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant
0