8000 [ENH] Add Subsequence Extraction Transformer to `sktime.transformations` module by wirrywoo · Pull Request #6967 · sktime/sktime · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[ENH] Add Subsequence Extraction Transformer to sktime.transformations module #6967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Aug 23, 2024

Conversation

wirrywoo
Copy link
Contributor
@wirrywoo wirrywoo commented Aug 13, 2024

Reference Issues/PRs

Fixes #6901.

What does this implement/fix? Explain your changes.

Subsequence extraction transformer is a transformer created to extract subsequences of specified length that
meet some criterion with respect to an aggregate function. It addresses questions like the following:

  • Given a sequence $(x_1, x_2, ..., x_n)$, find a $k$ - length subsequence such that the average value of $(x_j, x_{j+1}, ..., x_{j+k-1})$ is maximal.

Does your contribution introduce a new dependency? If yes, which one?

Nope.

What should a reviewer concentrate their feedback on?

  • Agreed naming choices for parameters and class variables involved
  • Potential areas of improvements regarding optimizing code performance and code refactoring
  • Ensure transformed results are accurate
  • Descriptive documentation

Did you add any tests for the change?

Nope.

Any other comments?

Excited to add my first PR to sktime. Really looking forward to the reviews!

PR checklist

For all contributions
  • I've added myself to the list of contributors with any new badges I've earned :-)
    How to: add yourself to the all-contributors file in the sktime root directory (not the CONTRIBUTORS.md). Common badges: code - fixing a bug, or adding code logic. doc - writing or improving documentation or docstrings. bug - reporting or diagnosing a bug (get this plus code if you also fixed the bug in the PR).maintenance - CI, test framework, release.
    See here for full badge reference
  • Optionally, for added estimators: I've added myself and possibly to the maintainers tag - do this if you want to become the owner or maintainer of an estimator you added.
    See here for further details on the algorithm maintainer role.
  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.
For new estimators
  • I've added the estimator to the API reference - in docs/source/api_reference/taskname.rst, follow the pattern.
  • I've added one or more illustrative usage examples to the docstring, in a pydocstyle compliant Examples section.
  • If the estimator relies on a soft dependency, I've set the python_dependencies tag and ensured
    dependency isolation, see the estimator dependencies guide.

@wirrywoo wirrywoo marked this pull request as ready for review August 13, 2024 22:52
@fkiraly fkiraly added module:transformations transformations module: time series transformation, feature extraction, pre-/post-processing enhancement Adding new functionality labels Aug 14, 2024
Copy link
Collaborator
@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions about the design here.

  • why is this a panel transformer? Do we need information about the other sequences for one given sequence?
  • related: why are we padding at first? Should the algorithm not also work for unequal length series?
  • currently, we are carrying out the aggregation in fit and remembering indices for transform, correct? Should we not instead carry it out in transform, i nthe first place?
  • we can carry out minimal parameter validation in fit, or in __init__, to avoid the "else" and "try-except" blocks later.

Further, code formatting checks are faliing, I assume that precommit has not been fully set up by you - here's how to set them up locally: https://www.sktime.net/en/latest/developer_guide/coding_standards.html#setting-up-local-code-quality-checks

@wirrywoo
Copy link
Contributor Author
wirrywoo commented Aug 14, 2024

Responses to @fkiraly:

  • why is this a panel transformer? Do we need information about the other sequences for one given sequence?

Ahh, I was referencing TruncationTransformer thinking that the functionality would be similar as I'm trying to identify the correct indices to extract the subsequences, but now I realized why TruncationTransformer would be considered a panel transformer while SubsequenceExtractionTransformer wouldn't. Perhaps it should be classified as a Series-to-series transformer under Window-based series transforms instead? I made this change in this commit.

  • related: why are we padding at first? Should the algorithm not also work for unequal length series?

Remo 8000 ved in this commit. Initially I had it to pass some checks, but found an alternative solution without needing to use PaddingTransformer.

  • currently, we are carrying out the aggregation in fit and remembering indices for transform, correct? Should we not instead carry it out in transform, i nthe first place?

Initially, I defined "fitting" as trying to compute the indices needed to retrieve the correct subsequence from the sequence. Then, I reserved the "transform" step to actually retrieve the subsequence from the input data. Do you think this is a good way to design this transformer or can the fitting definition be improved here?

  • we can carry out minimal parameter validation in fit, or in init, to avoid the "else" and "try-except" blocks later.

Refactored in this commit.

Further, code formatting checks are faliing, I assume that precommit has not been fully set up by you - here's how to set them up locally: https://www.sktime.net/en/latest/developer_guide/coding_standards.html#setting-up-local-code-quality-checks

Ahh, I forgot that I reconfigured my WSL instance to use Ubuntu 22.04 and forgot to reinstall the pre-commits in that process. Thanks!

@wirrywoo wirrywoo requested a review from fkiraly August 14, 2024 14:36
@fkiraly
Copy link
Collaborator
fkiraly commented Aug 14, 2024

Do you think this is a good way to design this transformer or can the fitting definition be improved here?

In general, the series seen in transform can be different from the one seen in fit. So, should we not carry out the computation in transform? Not entirely sure though.

Perhaps it should be classified as a Series-to-series transformer under Window-based series transforms instead?

Yes, series-to-series is the correct type in my opinion.

@wirrywoo
Copy link
Contributor Author

In general, the series seen in transform can be different from the one seen in fit. So, should we not carry out the computation in transform? Not entirely sure though.

I see. It does make sense especially if we are using a fitted transformer for inference purposes, so after more thought, I agree with you. Changes are in this commit.

Thanks!

@fkiraly
Copy link
Collaborator
fkiraly commented Aug 15, 2024

great, restarting the tests.

@wirrywoo
Copy link
Contributor Author

Apologies for the failed pre-commit. I believe I resolved it this time. I need to double check if I set my pre-commits correctly on my local machine.

Can you restart the tests when you get a chance? Thanks!

@fkiraly
Copy link
Collaborator
fkiraly commented Aug 15, 2024

(another go)

@wirrywoo
Copy link
Contributor Author
wirrywoo commented Aug 15, 2024

Running into build issue #6772, namely Read the Docs build failed due to timeout. Confirmed that this issue shouldn't block the PR's build. A restart on the build until the docs build step passes should do the trick.

fkiraly
fkiraly previously approved these changes Aug 15, 2024
Copy link
Collaborator
@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Suggestions for improvement, but not blocking:

  • we could allow an arbitrary series-to-primitives transformer (from sktime), or a function with specific signature, as the aggregate algorithm
  • the docstring should say whether subsequence_len is in iloc or loc units
  • the docstring should say what exactly the algorithm is - it is implied but never explicitly stated
  • are there any literature references for the use of this algorithm that one should add? It is simple, but perhaps where it is used as a component is helpful.
  • Perhaps different naming might be more convenient to the user? Shorten subsequence_len, and call method (not very specific), selector or select_fun orselect_method or similar?

@wirrywoo
Copy link
Contributor Author
wirrywoo commented Aug 19, 2024

Some updates from the great recommendations provided by @fkiraly (Thanks!):

  • we could allow an arbitrary series-to-primitives transformer (from sktime), or a function with specific signature, as the aggregate algorithm

I was able to add numpy function support for aggregate_fn but I was unable to integrate sktime's series-to-primitives transformer since these transformers return a set of primitives or scalars in the form of a pd.DataFrame rather than one scalar.

  • the docstring should say whether subsequence_len is in iloc or loc units

Addressed here.

  • the docstring should say what exactly the algorithm is - it is implied but never explicitly stated

Addressed here. Is there a way to check if the LaTeX is correctly rendered and if the hyperlink reference to Wikipedia is also correctly formatted?

  • are there any literature references for the use of this algorithm that one should add? It is simple, but perhaps where it is used as a component is helpful.

Added here. It took me a while to find one but it turns out that this transformer generalizes the maximum sum subarray problem to using other aggregate functions and selector methods. Because of this, I changed the default values of the transformer here.

  • Perhaps different naming might be more convenient to the user? Shorten subsequence_len, and call method (not very specific), selector or select_fun orselect_method or similar?

Changed subsequence_len to subseq_len and method to selector throughout this commit.

@fkiraly
Copy link
Collaborator
fkiraly commented Aug 19, 2024

ok, great! Already approved before, but before approving again, we need to make sure that the tests run.

@wirrywoo wirrywoo requested a review from fkiraly August 19, 2024 11:31
1. :math:`A(x_{i}, \cdots, x_{i+k-1})` is maximal when `selector = 'max'`, and
2. :math:`A(x_{i}, \cdots, x_{i+k-1})` is minimal when `selector = 'min'`.

When `aggregate_fn = np.sum` and `selector = 'max'`, the problem degenerates to the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to ensure the docstring displays code fonts correctly, use double backticks, not single backticks for code. Single backticks are reserved for rst expressions such as :math: .

Also, minor point about this sentence, I would lead with the special case: "Maximum sum subarray is a special case and can be obtained by setting ..." - if this is an important subcase, I would also mention it further up.

"handles-missing-data": False,
}

def __init__(self, subseq_len, aggregate_fn=np.sum, kwargs={}, selector="max"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we must avoid functions and mutable defaults like the dict.

The pattern to deal with this is using None, and internally a private attribute self._aggregate_fn (since the attributes of same name should always be identical to init args).

Copy link
Collaborator
@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only some small change requests remaining, see above.

@wirrywoo
Copy link
Contributor Author

Thanks for the review @fkiraly. Hopefully this will be the last batch of code updates. 😄

to ensure the docstring displays code fonts correctly, use double backticks, not single backticks for code. Single backticks are reserved for rst expressions such as :math: .

Also, minor point about this sentence, I would lead with the special case: "Maximum sum subarray is a special case and can be obtained by setting ..." - if this is an important subcase, I would also mention it further up.

Addressed here.

we must avoid functions and mutable defaults like the dict.

The pattern to deal with this is using None, and internally a private attribute self._aggregate_fn (since the attributes of same name should always be identical to init args).

Addressed here. Similar to func_fitter.py [code here], I made aggregate_fn a required argument rather than enforcing a default callable value. I think this is a better design as there's really no reason to actually have a default callable based on the goals of the transformer. I also defaulted kwargs = None.

we should also add one param set that has only defaults, except for subseq_len which needs to be set.

Because I decided to make aggregate_fn a required parameter, I added a third case when all required parameters are defined here.

remove one newline

Removed here.

fkiraly
fkiraly previously approved these changes Aug 19, 2024
Copy link
Collaborator
@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! left one non-blocking comment about documenting the options.

@wirrywoo
Copy link
Contributor Author
wirrywoo commented Aug 19, 2024

Thanks @fkiraly !

I think most users will not know what this means, I would suggest providing a more general signature (e.g., 2D array -> scalar), or giving a few common examples.

Agreed. Updated here. I also updated the example with the required parameters here.

fkiraly
fkiraly previously approved these changes Aug 19, 2024
@wirrywoo
Copy link
Contributor Author
wirrywoo commented Aug 19, 2024

Addressed merge conflicts in .all-contributorsrc here.

@wirrywoo wirrywoo requested a review from fkiraly August 20, 2024 11:51
@fkiraly fkiraly merged commit 1036cd8 into sktime:main Aug 23, 2024
44 of 45 checks passed
@wirrywoo wirrywoo deleted the origin/subsequence-extraction branch August 23, 2024 16:40
benHeid pushed a commit to Z-Fran/sktime that referenced this pull request Oct 3, 2024
…ns` module (sktime#6967)

#### Reference Issues/PRs

Fixes sktime#6901.

#### What does this implement/fix? Explain your changes.
Subsequence extraction transformer is a transformer created to extract
subsequences of specified length that
meet some criterion with respect to an aggregate function. It addresses
questions like the following:

- Given a sequence $(x_1, x_2, ..., x_n)$, find a $k$ - length
subsequence such that the average value of $(x_j, x_{j+1}, ...,
x_{j+k-1})$ is maximal.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Adding new functionality module:transformations transformations module: time series transformation, feature extraction, pre-/post-processing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ENH] Add Subsequence Extraction Transformer to sktime.transformations module
2 participants
0