8000 Add metadata columns to the RDBMS contrib by thisiscab · Pull Request #2440 · spotify/luigi · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add metadata columns to the RDBMS contrib #2440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

thisiscab
Copy link
Contributor
@thisiscab thisiscab commented Jun 7, 2018

The goal of this feature is to allow metadata column to exists for
specific tables created for Redshift.

Given the scenario where we would always have to have a created_tz
column at the end of every table generated by that contrib we could do
the following:

class UpdateTableTask(redshift.S3CopyToTable):

    def metadata_columns(self):
        return [('created_tz', 'TIMESTAMP')]

    def metadata_queries(self):
        query =  'UPDATE {0} ' \ 
                 'SET created_tz = CURRENT_TIMESTAMP ' \
                 'WHERE created_tz IS NULL'.format(self.table)
        return [query]

Adding a layer of abstraction over this feature, you could easily
add many default behavior for specific tables for versioning the table
and more.

This feature is opt-in by default since we don't want this break other
people's pipeline after integrating this.

Do we think that this would be a feature that you guys would be interested in merging? I'm unsure if this becomes too specific for our use-case, but it should help some other people to have that type of behavior if need be.

If that's something that you guys are interested in merging, then I'll spend some time in writing a proper testing suite!

Let me know!

Copy link
Collaborator
@dlstadther dlstadther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a cool addition; however, i'm not completely sold on it yet. It really feels like a specific wrapper around post_copy, but it's executed along side it.

The end goal you're trying to achieve is to easily modify a set series of metadata columns after performing a Redshift Copy. This feels like a finer-grained version of post_copy.

By no means am I rejecting this feature. Let's continue this chat to determine if this is something we can better fit into a streamlined process, without introducing a lot of complexity.

Thanks for submitting!

@@ -37,6 +37,66 @@
"Will crash at runtime if postgres functionality is used.")


class _MetadataColumnsMixin(object):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't distinct to Redshift as i believe this could also function in Postgres. Possibly move this mixin to rdbms.py?

@@ -321,6 +381,9 @@ def create_table(self, connection):
raise ValueError("create_table() found no columns for %r"
% self.table)

if self.do_enable_metadata_columns:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to enable_metadata_columns

@@ -338,6 +401,9 @@ def run(self):
self.copy(cursor, path)
self.post_copy(cursor)

if self.do_enable_metadata_columns:
self.post_copy_metacolumns(cursor)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused on how this would interact with post_copy. Seems like you would never run both together.

Copy link
Contributor Author
@thisiscab thisiscab Jun 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've highlighted an example at the PR level displaying a use-case where we have both cases that run.

@thisiscab
Copy link
Contributor Author
thisiscab commented Jun 7, 2018

This seems like a cool addition; however, i'm not completely sold on it yet. It really feels like a specific wrapper around post_copy, but it's executed along side it.

You would be right to say that it's a wrapper around post_copy because in our original implementation, we were using post_copy instead of post_copy_metacolumns.

I would agree that they are pretty much the same but we use post_copy as task-specifics post copy queries. For instance, we happen to use post_copy to deduplicate possible duplicated output. or very task-specific behaviors.

Alongside this, we also want to be able to run generic updates. For instance, created_tz timestamp. It made sense to decouple them, in our mind, to differentiate task-specific post_copy and generic updates to table with post_copy_metacolumns.

I've created a gist that may better demonstrate how we use it: https://gist.github.com/cabouffard/73cfc798bef9ec09fc91807b99a6c5a9

That said, having this separated in two different methods, allow us to not have to override the content of post_copy if we want to add generic metadata_columns behavior.

The end goal you're trying to achieve is to easily modify a set series of metadata columns after performing a Redshift Copy. This feels like a finer-grained version of post_copy.

Totally agreed with this statement!

@thisiscab
Copy link
Contributor Author

I feel like I haven't sold on your on this one, is that the case @dlstadther ? :)

@dlstadther
Copy link
Collaborator

Hey @cabouffard

After thinking on this all morning, I think my hesitations here are more-so S3CopyToTable structural issues than feature addition problems.

So, let's continue with getting your metadata feature pushed through. I believe I would like like to see the Mixin moved to rdbms (assuming it's also applicable to Postgres (which i believe is the case)).

Can you also mesh some of your gist together in a docstring for the Mixin? I believe another hesitation I had (prior to your gist) was lack of implementation clarity.

I very much appreciate all your recent PRs! Sorry for the delay addressing them to completion!

@thisiscab thisiscab force-pushed the feature/add-metadata-columns-to-redshift branch 5 times, most recently from 8394978 to 22d4912 Compare July 10, 2018 14:45
@thisiscab
Copy link
Contributor Author

@dlstadther No worries mate!

I've added a whole bunch of text, some extra bit of documentation and changed some code-part with the suggestions you've provided.

Is there anything else that you would like for me to do? I'm happy to hear any suggestions!
Thanks! :)

@thisiscab thisiscab changed the title Add metadata columns to Redshift tables Add metadata columns to the RDBMS contrib Jul 10, 2018
@thisiscab
Copy link
Contributor Author

@dlstadther Hey, what's the latest on this one? :)

@dlstadther
Copy link
Collaborator

Sorry for my delayed response here @cabouffard . I pretty sure i'm all good here, but let me review real quick once more this afternoon :)

Copy link
Collaborator
@dlstadther dlstadther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a few grammatical issues, a syntax improvement, and one core question.

Other than that, I'm good with this. Particularly, since the default behavior of Redshift Copy will not change.

@@ -27,7 +27,125 @@
logger = logging.getLogger('luigi-interface')


class CopyToTable(luigi.task.MixinNaiveBulkComplete, luigi.Task):
class _MetadataColumnsMixin(object):
"""Provide an additinal behavior that adds columns and values to tables
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo in additional


Example:

This is a use example or how this mixin could come handy and how to use
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you trying to say This is an example use-case of how this mixin could come in handy.?

def update_report_execution_date_query(self):
query = 'UPDATE {0} ' \
"SET date_param = DATE '{1}' " \
'WHERE date_param IS NULL'.format(self.table, self.date)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the same quote style for these lines? Rather than going back and forth between single and double quote.



class RedshiftCopyCSVToTableFromS3(CommonMetaColumnsBehavior, redshift.S3CopyToTable):
"We have some business override heres that would only add noise to the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here (not heres)

for column in self.metadata_columns:
if len(column) == 0:
logger.info('Unable to infer column information from column {column} for {table}'.format(column=column, table=self.table))
break
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this raise an exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thorned with that decision. Raising an exception would halt the pipeline and other tasks from running while breaking out of the function would simply not add the meta_columns like it would be expected.

Breaking would proceed with the execution of the pipeline but may yield potentially unexpected results, while raising would halt everything from running.

I think that you bring a fair point and I'll implement an exception as this is how it seems that we're managing that type of cases throughout that contrib.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that neither solution is the most ideal. But i'm more comfortable being explicit in what must be done to achieve happy path. Anything straying from that to raise an exception.



# Using Redshift as the test bed since Redshift implements RDBMS. Could have
# opted for PSQL but I'm less familar with that contrib.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps make this comment a docstring to add emphasis. I almost asked why redshift was being used, but then i saw this comment

The goal of this feature is to allow metadata column to exists for
specific tables created by the Redshift contrib related tasks.

Given the scenario where we would always have to have a `created_tz`
column at the end of every tables generated by that contrib we could do
the following:

```python
UpdateTableTask(redshift.S3CopyToTable):

    def metadata_columns(self):
        return [('created_tz', 'TIMESTAMP')]

    def post_queries(self):
        return query = 'UPDATE {0} ' \
                       'SET created_tz = CURRENT_TIMESTAMP ' \
                       'WHERE created_tz IS NULL'.format(self.table)
```

Adding layer of abstraction over this feature, you could easily
add many default behavior for specific tables for versioning the table
and more.

This feature is opt-in by default since we don't want this break other
people's pipeline after integrating this.
As suggested in the code-review, there are multiple other DBs that could
benefit from this change. Currently, only PSQL and Redshift implements
RDMS but other may implement this class and inherits that new behavior.
We've been internally using this feature for Redshift but moving this to
the RDBMS contrib and adding this behavior to PSQL could have unexpected
side effects, this takes care of testing if the feature works correctly
under Redshift and PSQL.
If the count of metadata_columns is 0 and we're expecting to add them to
the table, then we raise an error because that is an invalid flow.

The contributor is required to have metadata_columns values if we want
to add that column to the table.
@thisiscab thisiscab force-pushed the feature/add-metadata-columns-to-redshift branch from 22d4912 to 70d7023 Compare August 2, 2018 17:09
@thisiscab
Copy link
Contributor Author

@dlstadther Hey ya, we're all green and I've added your suggestions / fixed typos! :)

Copy link
Collaborator
@dlstadther dlstadther left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your work here!

@dlstadther dlstadther merged commit 0a63cc0 into spotify:master Aug 2, 2018
thisiscab pushed a commit to glossier/luigi that referenced this pull request Aug 3, 2018
* Add metadata column feature to Redshift

The goal of this feature is to allow metadata column to exists for
specific tables created by the Redshift contrib related tasks.

Given the scenario where we would always have to have a `created_tz`
column at the end of every tables generated by that contrib we could do
the following:

```python
UpdateTableTask(redshift.S3CopyToTable):

    def metadata_columns(self):
        return [('created_tz', 'TIMESTAMP')]

    def post_queries(self):
        return query = 'UPDATE {0} ' \
                       'SET created_tz = CURRENT_TIMESTAMP ' \
                       'WHERE created_tz IS NULL'.format(self.table)
```

Adding layer of abstraction over this feature, you could easily
add many default behavior for specific tables for versioning the table
and more.

This feature is opt-in by default since we don't want this break other
people's pipeline after integrating this.

* Move the Metadata Columns implementation to the RDBMS

As suggested in the code-review, there are multiple other DBs that could
benefit from this change. Currently, only PSQL and Redshift implements
RDMS but other may implement this class and inherits that new behavior.

* Add tests for the Metadata behaviors

We've been internally using this feature for Redshift but moving this to
the RDBMS contrib and adding this behavior to PSQL could have unexpected
side effects, this takes care of testing if the feature works correctly
under Redshift and PSQL.

* Add additional documentation on how to use the new mixin

* Raise ValueError on invalid metadata_columns for RDBMS

If the count of metadata_columns is 0 and we're expecting to add them to
the table, then we raise an error because that is an invalid flow.

The contributor is required to have metadata_columns values if we want
to add that column to the table.
thisiscab pushed a commit to glossier/luigi that referenced this pull request Aug 8, 2018
* Add metadata column feature to Redshift

The goal of this feature is to allow metadata column to exists for
specific tables created by the Redshift contrib related tasks.

Given the scenario where we would always have to have a `created_tz`
column at the end of every tables generated by that contrib we could do
the following:

```python
UpdateTableTask(redshift.S3CopyToTable):

    def metadata_columns(self):
        return [('created_tz', 'TIMESTAMP')]

    def post_queries(self):
        return query = 'UPDATE {0} ' \
                       'SET created_tz = CURRENT_TIMESTAMP ' \
                       'WHERE created_tz IS NULL'.format(self.table)
```

Adding layer of abstraction over this feature, you could easily
add many default behavior for specific tables for versioning the table
and more.

This feature is opt-in by default since we don't want this break other
people's pipeline after integrating this.

* Move the Metadata Columns implementation to the RDBMS

As suggested in the code-review, there are multiple other DBs that could
benefit from this change. Currently, only PSQL and Redshift implements
RDMS but other may implement this class and inherits that new behavior.

* Add tests for the Metadata behaviors

We've been internally using this feature for Redshift but moving this to
the RDBMS contrib and adding this behavior to PSQL could have unexpected
side effects, this takes care of testing if the feature works correctly
under Redshift and PSQL.

* Add additional documentation on how to use the new mixin

* Raise ValueError on invalid metadata_columns for RDBMS

If the count of metadata_columns is 0 and we're expecting to add them to
the table, then we raise an error because that is an invalid flow.

The contributor is required to have metadata_columns values if we want
to add that column to the table.
dlstadther added a commit to dlstadther/luigi that referenced this pull request Aug 14, 2018
* upstream-master: (82 commits)
  S3 client refactor (spotify#2482)
  Rename to rpc_log_retries, and make it apply to all the logging involved
  Factor log_exceptions into a configuration parameter
  Fix attribute forwarding for tasks with dynamic dependencies (spotify#2478)
  Add a visiblity level for luigi.Parameters (spotify#2278)
  Add support for multiple requires and inherits arguments (spotify#2475)
  Add metadata columns to the RDBMS contrib (spotify#2440)
  Fix race condition in luigi.lock.acquire_for (spotify#2357) (spotify#2477)
  tests: Use RunOnceTask where possible (spotify#2476)
  Optional TOML configs support (spotify#2457)
  Added default port behaviour for Redshift (spotify#2474)
  Add codeowners file with default and specific example (spotify#2465)
  Add Data Revenue to the `blogged` list (spotify#2472)
  Fix Scheduler.add_task to overwrite accepts_messages attribute. (spotify#2469)
  Use task_id comparison in Task.__eq__. (spotify#2462)
  Add stale config
  Move github templates to .github dir
  Fix transfer config import (spotify#2458)
  Additions to provide support for the Load Sharing Facility (LSF) job scheduler (spotify#2373)
  Version 2.7.6
  ...
dlstadther added a commit to dlstadther/luigi that referenced this pull request Aug 14, 2018
* upstream-master:
  S3 client refactor (spotify#2482)
  Rename to rpc_log_retries, and make it apply to all the logging involved
  Factor log_exceptions into a configuration parameter
  Fix attribute forwarding for tasks with dynamic dependencies (spotify#2478)
  Add a visiblity level for luigi.Parameters (spotify#2278)
  Add support for multiple requires and inherits arguments (spotify#2475)
  Add metadata columns to the RDBMS contrib (spotify#2440)
  Fix race condition in luigi.lock.acquire_for (spotify#2357) (spotify#2477)
  tests: Use RunOnceTask where possible (spotify#2476)
  Optional TOML configs support (spotify#2457)
  Added default port behaviour for Redshift (spotify#2474)
  Add codeowners file with default and specific example (spotify#2465)
  Add Data Revenue to the `blogged` list (spotify#2472)
dlstadther added a commit to dlstadther/luigi that referenced this pull request Aug 16, 2018
* upstream-master:
  Remove long-deprecated scheduler config variable alternatives (spotify#2491)
  Bump tornado milestone version (spotify#2490)
  Update moto to 1.x milestone version (spotify#2471)
  Use passed password when create a redis connection (spotify#2489)
  S3 client refactor (spotify#2482)
  Rename to rpc_log_retries, and make it apply to all the logging involved
  Factor log_exceptions into a configuration parameter
  Fix attribute forwarding for tasks with dynamic dependencies (spotify#2478)
  Add a visiblity level for luigi.Parameters (spotify#2278)
  Add support for multiple requires and inherits arguments (spotify#2475)
  Add metadata columns to the RDBMS contrib (spotify#2440)
  Fix race condition in luigi.lock.acquire_for (spotify#2357) (spotify#2477)
  tests: Use RunOnceTask where possible (spotify#2476)
  Optional TOML configs support (spotify#2457)
  Added default port behaviour for Redshift (spotify#2474)
  Add codeowners file with default and specific example (spotify#2465)
  Add Data Revenue to the `blogged` list (spotify#2472)
  Fix Scheduler.add_task to overwrite accepts_messages attribute. (spotify#2469)
  Use task_id comparison in Task.__eq__. (spotify#2462)
@thisiscab thisiscab deleted the feature/add-metadata-columns-to-redshift branch January 25, 2019 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0