You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To understand basic about PySpark API and, how it process the data, Spark dataframe & its data parallelism concept, etc.
Please exercise to see whether you can reuse skel template, to say, if you would process and stage another data source for warehouse
Observe terraform from deploy directory and its Glue ETL scripts deployment in prod account (try invoke the job run, etc - no worries, ETL scripts are idempotent)
Observe existing Glue script on spreadsheet processing
In current Glue ETL script; prior loading into the warehouse staging database, investigate whether we can also perform "datalake" the output data (processed or unprocessed/source as-is or etc)? Historical, archival, etc.
Perhaps, better yet, S3 Table bucket with Iceberg over datalake?
To discuss pros/cons and, use when it fits the use case, etc
Choose Python or Scala for the activity
The text was updated successfully, but these errors were encountered:
This task aims as experimentation and, to acquire understanding on warehouse data staging activity via Glue.
Items:
skel
template, to say, if you would process and stage another data source for warehouseThe text was updated successfully, but these errors were encountered: