[ENH] Add Subsequence Extraction Transformer to `sktime.transformations` module #6967

wirrywoo · 2024-08-13T22:40:07Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Subsequence extraction transformer is a transformer created to extract subsequences of specified length that
meet some criterion with respect to an aggregate function. It addresses questions like the following:

Given a sequence $(x_1, x_2, ..., x_n)$, find a $k$ - length subsequence such that the average value of $(x_j, x_{j+1}, ..., x_{j+k-1})$ is maximal.

Does your contribution introduce a new dependency? If yes, which one?

Nope.

What should a reviewer concentrate their feedback on?

Agreed naming choices for parameters and class variables involved
Potential areas of improvements regarding optimizing code performance and code refactoring
Ensure transformed results are accurate
Descriptive documentation

Did you add any tests for the change?

Nope.

Any other comments?

Excited to add my first PR to sktime. Really looking forward to the reviews!

PR checklist

For all contributions

I've added myself to the list of contributors with any new badges I've earned :-)
How to: add yourself to the all-contributors file in the sktime root directory (not the CONTRIBUTORS.md). Common badges: code - fixing a bug, or adding code logic. doc - writing or improving documentation or docstrings. bug - reporting or diagnosing a bug (get this plus code if you also fixed the bug in the PR).maintenance - CI, test framework, release.
See here for full badge reference
Optionally, for added estimators: I've added myself and possibly to the maintainers tag - do this if you want to become the owner or maintainer of an estimator you added.
See here for further details on the algorithm maintainer role.
The PR title starts with either [ENH], [MNT], [DOC], or [BUG]. [BUG] - bugfix, [MNT] - CI, test framework, [ENH] - adding or improving code, [DOC] - writing or improving documentation or docstrings.

For new estimators

I've added the estimator to the API reference - in docs/source/api_reference/taskname.rst, follow the pattern.
I've added one or more illustrative usage examples to the docstring, in a pydocstyle compliant Examples section.
If the estimator relies on a soft dependency, I've set the python_dependencies tag and ensured
dependency isolation, see the estimator dependencies guide.

fkiraly

Some questions about the design here.

why is this a panel transformer? Do we need information about the other sequences for one given sequence?
related: why are we padding at first? Should the algorithm not also work for unequal length series?
currently, we are carrying out the aggregation in fit and remembering indices for transform, correct? Should we not instead carry it out in transform, i nthe first place?
we can carry out minimal parameter validation in fit, or in __init__, to avoid the "else" and "try-except" blocks later.

Further, code formatting checks are faliing, I assume that precommit has not been fully set up by you - here's how to set them up locally: https://www.sktime.net/en/latest/developer_guide/coding_standards.html#setting-up-local-code-quality-checks

wirrywoo · 2024-08-14T13:53:14Z

Responses to @fkiraly:

why is this a panel transformer? Do we need information about the other sequences for one given sequence?

Ahh, I was referencing TruncationTransformer thinking that the functionality would be similar as I'm trying to identify the correct indices to extract the subsequences, but now I realized why TruncationTransformer would be considered a panel transformer while SubsequenceExtractionTransformer wouldn't. Perhaps it should be classified as a Series-to-series transformer under Window-based series transforms instead? I made this change in this commit.

related: why are we padding at first? Should the algorithm not also work for unequal length series?

Remo 8000 ved in this commit. Initially I had it to pass some checks, but found an alternative solution without needing to use PaddingTransformer.

currently, we are carrying out the aggregation in fit and remembering indices for transform, correct? Should we not instead carry it out in transform, i nthe first place?

Initially, I defined "fitting" as trying to compute the indices needed to retrieve the correct subsequence from the sequence. Then, I reserved the "transform" step to actually retrieve the subsequence from the input data. Do you think this is a good way to design this transformer or can the fitting definition be improved here?

we can carry out minimal parameter validation in fit, or in init, to avoid the "else" and "try-except" blocks later.

Refactored in this commit.

Further, code formatting checks are faliing, I assume that precommit has not been fully set up by you - here's how to set them up locally: https://www.sktime.net/en/latest/developer_guide/coding_standards.html#setting-up-local-code-quality-checks

Ahh, I forgot that I reconfigured my WSL instance to use Ubuntu 22.04 and forgot to reinstall the pre-commits in that process. Thanks!

fkiraly · 2024-08-14T15:51:55Z

Do you think this is a good way to design this transformer or can the fitting definition be improved here?

In general, the series seen in transform can be different from the one seen in fit. So, should we not carry out the computation in transform? Not entirely sure though.

Perhaps it should be classified as a Series-to-series transformer under Window-based series transforms instead?

Yes, series-to-series is the correct type in my opinion.

wirrywoo · 2024-08-14T17:04:51Z

In general, the series seen in transform can be different from the one seen in fit. So, should we not carry out the computation in transform? Not entirely sure though.

I see. It does make sense especially if we are using a fitted transformer for inference purposes, so after more thought, I agree with you. Changes are in this commit.

Thanks!

fkiraly · 2024-08-15T08:23:51Z

great, restarting the tests.

wirrywoo · 2024-08-15T09:12:35Z

Apologies for the failed pre-commit. I believe I resolved it this time. I need to double check if I set my pre-commits correctly on my local machine.

Can you restart the tests when you get a chance? Thanks!

fkiraly · 2024-08-15T09:12:52Z

(another go)

wirrywoo · 2024-08-15T11:02:54Z

Running into build issue #6772, namely Read the Docs build failed due to timeout. Confirmed that this issue shouldn't block the PR's build. A restart on the build until the docs build step passes should do the trick.

fkiraly

Looks good to me!

Suggestions for improvement, but not blocking:

we could allow an arbitrary series-to-primitives transformer (from sktime), or a function with specific signature, as the aggregate algorithm
the docstring should say whether subsequence_len is in iloc or loc units
the docstring should say what exactly the algorithm is - it is implied but never explicitly stated
are there any literature references for the use of this algorithm that one should add? It is simple, but perhaps where it is used as a component is helpful.
Perhaps different naming might be more convenient to the user? Shorten subsequence_len, and call method (not very specific), selector or select_fun orselect_method or similar?

wirrywoo · 2024-08-19T10:26:24Z

Some updates from the great recommendations provided by @fkiraly (Thanks!):

we could allow an arbitrary series-to-primitives transformer (from sktime), or a function with specific signature, as the aggregate algorithm

I was able to add numpy function support for aggregate_fn but I was unable to integrate sktime's series-to-primitives transformer since these transformers return a set of primitives or scalars in the form of a pd.DataFrame rather than one scalar.

the docstring should say whether subsequence_len is in iloc or loc units

Addressed here.

the docstring should say what exactly the algorithm is - it is implied but never explicitly stated

Addressed here. Is there a way to check if the LaTeX is correctly rendered and if the hyperlink reference to Wikipedia is also correctly formatted?

are there any literature references for the use of this algorithm that one should add? It is simple, but perhaps where it is used as a component is helpful.

Added here. It took me a while to find one but it turns out that this transformer generalizes the maximum sum subarray problem to using other aggregate functions and selector methods. Because of this, I changed the default values of the transformer here.

Perhaps different naming might be more convenient to the user? Shorten subsequence_len, and call method (not very specific), selector or select_fun orselect_method or similar?

Changed subsequence_len to subseq_len and method to selector throughout this commit.

fkiraly · 2024-08-19T10:42:02Z

ok, great! Already approved before, but before approving again, we need to make sure that the tests run.

fkiraly · 2024-08-19T11:37:24Z

sktime/transformations/series/subsequence_extraction.py

+    1. :math:`A(x_{i}, \cdots, x_{i+k-1})` is maximal when `selector = 'max'`, and
+    2. :math:`A(x_{i}, \cdots, x_{i+k-1})` is minimal when `selector = 'min'`.
+
+    When `aggregate_fn = np.sum` and `selector = 'max'`, the problem degenerates to the


to ensure the docstring displays code fonts correctly, use double backticks, not single backticks for code. Single backticks are reserved for rst expressions such as :math: .

Also, minor point about this sentence, I would lead with the special case: "Maximum sum subarray is a special case and can be obtained by setting ..." - if this is an important subcase, I would also mention it further up.

fkiraly · 2024-08-19T11:38:41Z

sktime/transformations/series/subsequence_extraction.py

+        "handles-missing-data": False,
+    }
+
+    def __init__(self, subseq_len, aggregate_fn=np.sum, kwargs={}, selector="max"):


we must avoid functions and mutable defaults like the dict.

The pattern to deal with this is using None, and internally a private attribute self._aggregate_fn (since the attributes of same name should always be identical to init args).

sktime/transformations/series/subsequence_extraction.py

fkiraly

Only some small change requests remaining, see above.

wirrywoo · 2024-08-19T12:31:01Z

Thanks for the review @fkiraly. Hopefully this will be the last batch of code updates. 😄

to ensure the docstring displays code fonts correctly, use double backticks, not single backticks for code. Single backticks are reserved for rst expressions such as :math: .

Also, minor point about this sentence, I would lead with the special case: "Maximum sum subarray is a special case and can be obtained by setting ..." - if this is an important subcase, I would also mention it further up.

Addressed here.

we must avoid functions and mutable defaults like the dict.

The pattern to deal with this is using None, and internally a private attribute self._aggregate_fn (since the attributes of same name should always be identical to init args).

Addressed here. Similar to func_fitter.py [code here], I made aggregate_fn a required argument rather than enforcing a default callable value. I think this is a better design as there's really no reason to actually have a default callable based on the goals of the transformer. I also defaulted kwargs = None.

we should also add one param set that has only defaults, except for subseq_len which needs to be set.

Because I decided to make aggregate_fn a required parameter, I added a third case when all required parameters are defined here.

remove one newline

Removed here.

sktime/transformations/series/subsequence_extraction.py

fkiraly

Looks good! left one non-blocking comment about documenting the options.

wirrywoo · 2024-08-19T15:39:08Z

Thanks @fkiraly !

I think most users will not know what this means, I would suggest providing a more general signature (e.g., 2D array -> scalar), or giving a few common examples.

Agreed. Updated here. I also updated the example with the required parameters here.

wirrywoo · 2024-08-19T22:36:16Z

Addressed merge conflicts in .all-contributorsrc here.

…ns` module (sktime#6967) #### Reference Issues/PRs Fixes sktime#6901. #### What does this implement/fix? Explain your changes. Subsequence extraction transformer is a transformer created to extract subsequences of specified length that meet some criterion with respect to an aggregate function. It addresses questions like the following: - Given a sequence $(x_1, x_2, ..., x_n)$, find a $k$ - length subsequence such that the average value of $(x_j, x_{j+1}, ..., x_{j+k-1})$ is maximal.

wirrywoo added 2 commits August 13, 2024 23:28

Create subsequence_extraction.py

96c029d

updated docs

ade59a1

wirrywoo marked this pull request as ready for review August 13, 2024 22:52

wirrywoo requested review from achieveordie, benHeid, fkiraly and yarnabrina as code owners August 13, 2024 22:52

fkiraly added module:transformations transformations module: time series transformation, feature extraction, pre-/post-processing enhancement Adding new functionality labels Aug 14, 2024

fkiraly requested changes Aug 14, 2024

View reviewed changes

wirrywoo added 4 commits August 14, 2024 11:49

removed PaddingTransformer and addressed lint

02d4315

resolved lint issues

4a947a0

refactored error handling

33e20d4

moved to series folder from panel

9391fdc

wirrywoo requested a review from fkiraly August 14, 2024 14:36

redefine fit function and improved test cases

71f768c

resolve failed ruff-format test

3e843f3

fkiraly previously approved these changes Aug 15, 2024

View reviewed changes

addressed all fkiraly's suggestions for improvement

eef8b96

wirrywoo dismissed fkiraly’s stale review via eef8b96 August 19, 2024 10:14

wirrywoo requested a review from fkiraly August 19, 2024 11:31

fkiraly reviewed Aug 19, 2024

View reviewed changes

sktime/transformations/series/subsequence_extraction.py Show resolved Hide resolved

fkiraly reviewed Aug 19, 2024

View reviewed changes

sktime/transformations/series/subsequence_extraction.py Show resolved Hide resolved

fkiraly requested changes Aug 19, 2024

View reviewed changes

addressed fkiraly's code review

6d2c976

Merge branch 'sktime:main' into origin/subsequence-extraction

272e633

fkiraly reviewed Aug 19, 2024

View reviewed changes

sktime/transformations/series/subsequence_extraction.py Outdated Show resolved Hide resolved

fkiraly previously approved these changes Aug 19, 2024

View reviewed changes

updated aggregate_fn description and updated example

83e37f7

wirrywoo dismissed fkiraly’s stale review via 83e37f7 August 19, 2024 15:36

fkiraly previously approved these changes Aug 19, 2024

View reviewed changes

Merge branch 'main' into origin/subsequence-extraction

d0d4912

wirrywoo dismissed fkiraly’s stale review via d0d4912 August 19, 2024 22:33

resolved formatting in .all-contributorsrc

70af559

wirrywoo requested a review from fkiraly August 20, 2024 11:51

fkiraly merged commit 1036cd8 into sktime:main Aug 23, 2024
44 of 45 checks passed

wirrywoo deleted the origin/subsequence-extraction branch August 23, 2024 16:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[ENH] Add Subsequence Extraction Transformer to `sktime.transformations` module #6967

[ENH] Add Subsequence Extraction Transformer to `sktime.transformations` module #6967

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[ENH] Add Subsequence Extraction Transformer to sktime.transformations module #6967

[ENH] Add Subsequence Extraction Transformer to sktime.transformations module #6967

Uh oh!

Conversation

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Does your contribution introduce a new dependency? If yes, which one?

What should a reviewer concentrate their feedback on?

Did you add any tests for the change?

Any other comments?

PR checklist

For all contributions

For new estimators

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[ENH] Add Subsequence Extraction Transformer to `sktime.transformations` module #6967

[ENH] Add Subsequence Extraction Transformer to `sktime.transformations` module #6967