8000 Remove scikit-learn as a dependency by AJTheDataGuy · Pull Request #774 · nci/scores · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Remove scikit-learn as a dependency #774

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Nov 28, 2024

Conversation

AJTheDataGuy
Copy link
Contributor

Please work through the following checklists. Delete anything that isn't relevant.

Development for new xarray-based metrics

  • Works with n-dimensional data and includes reduce_dims, preserve_dims, and weights args.
  • Typehints added
  • Add error handling
  • Imported into the API
  • Works with both xr.DataArrays and xr.Datasets if possible

Docstrings

  • Docstrings complete and follow Napoleon (google) style
  • Maths equation added
  • Reference to paper/webpage is in docstring. The preferred referencing style for journal articles is APA (7th edition)
  • Code example added

Testing of new xarray-based metrics

  • 100% unit test coverage
  • Test that metric is compatible with dask.
  • Test that metrics work with inputs that contain NaNs
  • Test that broadcasting with xarray works
  • Test both reduce and preserve dims arguments work
  • Test that errors are raised as expected
  • Test that it works with both xr.DataArrays and xr.Datasets

Tutorial notebook

  • Short introduction to why you would use that metric and what it tells you
  • A link to a reference
  • A "things to try next" section at the end
  • Add notebook to Tutorial_Gallery.ipynb
  • Optional - a detailed discussion of how the metric works at the end of the notebook

Documentation

Copy link
Collaborator
@Steph-Chong Steph-Chong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AJTheDataGuy thanks for this PR.

I have made two minor comments for your consideration.

@tennlee tennlee linked an issue Nov 26, 2024 that may be closed by this pull request
@tennlee
Copy link
Collaborator
tennlee commented Nov 26, 2024

To anyone reviewing this (perhaps @nicholasloveday), AJ has left some additional comments on testing in issue #636 which should be reviewed also.

@AJTheDataGuy
Copy link
Contributor Author

Hi @nicholasloveday I just ran some tests.

Using a modified version of the testing script I ran 4 tests. All 4 tests used x,y, and weight arrays of 10 million elements each and I used cProfile to time the results for the overall script, running cProfile from the command line. The 4 tests I ran were:

  1. Scikit (1 run - simulate user creating a single model )
  2. Scikit (50 runs - simulate user creating 50 models)
  3. Scipy (1 run - simulate user creating a single model)
  4. Scipy (50 runs - simulate user creating 50 models)

1 run results:

  • Scikit: 2.9s
  • Scipy: 2.3s
  • Winner: Scipy (20% improvement)

50 run results:

  • Scikit: 105s
  • Scipy: 50s
  • Winner: Scipy (over 50% improvement)

These initial tests suggest Scipy is a nice performance enhance over Scikit-learn for this use case.

I've attached the python test files (start with "test") and well as the results files (start with "result") if you'd like to take a look.

test_scipy_50.txt
test_scikit_50.txt
test_scikit_1.txt
test_scipy_1.txt

result_scikit_1.txt
result_scikit_50.txt
result_scipy_1.txt
result_scipy_50.txt

@Steph-Chong I will make those changes soon.

@nicholasloveday
Copy link
Collaborator

Thanks for checking that @AJTheDataGuy - that's great that there are some performance improvements!

AJ Fisher and others added 6 commits November 28, 2024 12:59
@AJTheDataGuy AJTheDataGuy force-pushed the isotonic-scipy-from-scikit branch from 643b9e7 to a5d0ecc Compare November 28, 2024 03:05
@AJTheDataGuy
Copy link
Contributor Author
AJTheDataGuy commented Nov 28, 2024

Hi All - just made some updates and pushed a commit to the feature branch to my fork.

Updates:

  1. Implemented the docstring comments for the _contiguous_mean_ir function as described above.
  2. Fixed a merge conflict on the zenodo file

3A. I noticed that the x array parameter was no longer needed on the _contiguous_mean_ir function so removed that.
3B. Removed all fcst (x array) parameters for calls to the _contiguous_mean_ir function
3C. In a similiar sense, the _do_ir function no longer required the fcst (x array) parameter. Removed that and changed the doc string.
3D. Removed the fcst parameter in all calls to the _do_ir function
(All changes to internal functions so hopefully the changes won't impact the end user.)

  1. Updated the test for test__do_ir in the test_isoreg.py file to reflect that the fcst (x array) parameter is no longer needed.

Pytest passing all tests at the end.

Can someone please rerun the workflows?

@tennlee tennlee merged commit 667f167 into nci:develop Nov 28, 2024
11 checks passed
@tennlee
Copy link
Collaborator
tennlee commented Nov 28, 2024

Thanks very much for this work. It's really helpful and it's a great feeling to reduce the number of dependencies.

nicholasloveday pushed a commit to nicholasloveday/scores that referenced this pull request Dec 2, 2024
* Refactored isotonic regression function from scikit-learn to scipy

Please note, this was manually speed tested, and proved to be faster according to that test regime. So this PR reduces complexity and may speed things up at the same time.

---------

Signed-off-by: Tennessee Leeuwenburg <134973832+tennlee@users.noreply.github.com>
Co-authored-by: AJ Fisher <AJTheDataGuy@gmail.com>
Co-authored-by: Tennessee Leeuwenburg <134973832+tennlee@users.noreply.github.com>
Co-authored-by: Stephanie Chong <168800785+Steph-Chong@users.noreply.github.com>
@tennlee tennlee changed the title Isotonic scipy from scikit Remove scikit-learn as a dependency Dec 2, 2024
@tennlee
Copy link
Collaborator
tennlee commented Dec 2, 2024

Fixes #636

tennlee added a commit that referenced this pull request Dec 7, 2024
* Refactored isotonic regression function from scikit-learn to scipy

Please note, this was manually speed tested, and proved to be faster according to that test regime. So this PR reduces complexity and may speed things up at the same time.

---------

Signed-off-by: Tennessee Leeuwenburg <134973832+tennlee@users.noreply.github.com>
Co-authored-by: AJ Fisher <AJTheDataGuy@gmail.com>
Co-authored-by: Tennessee Leeuwenburg <134973832+tennlee@users.noreply.github.com>
Co-authored-by: Stephanie Chong <168800785+Steph-Chong@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

scipy isotonic regression
4 participants
0