8000 Unify softmax implementations by elisno · Pull Request #826 · cleanlab/cleanlab · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Unify softmax implementations #826

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Aug 30, 2023
Merged

Unify softmax implementations #826

merged 14 commits into from
Aug 30, 2023

Conversation

elisno
Copy link
Member
@elisno elisno commented Aug 23, 2023

Summary

This PR refactors the usage of the softmax function across various modules within the cleanlab package. A dedicated softmax function is introduced to enhance code reusability and improve numerical stability, ensuring the codebase remains robust and maintainable.

Changes

  • Introduce a softmax utility function: Located in cleanlab/internal/numerics.py, this function includes options for softmax temperature, selection of axis, and numeric stability shift.
  • Refactored Existing Code: Replaced all explicit implementations of softmax with calls to the new softmax utility function across multiple modules (multiannotator_utils, multilabel_scorer, object_detection_utils, and token_classification/rank).
  • (secondary) Improved find_best_temp_scaler logic: Refactored the temperature scaling in the multiannotator_utils module to use the new softmax function. This also led to the introduction of the helper function _set_fine_search_range to better organize and segregate logic.

The PR assumes that the current testing suite covers these modules and that any potential deviations would be flagged.
It does not treat related implementations like "softmin", etc.

Usage

The softmax function can be used for both 1D and 2D arrays.

Here's a quick demonstration:

from cleanlab.internal.numerics import softmax
import numpy as np

# Sample data
array_1d = np.array([2.0, 1.0, 0.1])
array_2d = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])

array_2d_large_values = array_2d + 1e10

# 1D Array with temperature and shift
softmax_1d_temp = softmax(array_1d, temperature=2.0)
softmax_1d_shift = softmax(array_1d, shift=False)
print("Softmax for 1D array with temperature:", softmax_1d_temp)
print("Softmax for 1D array without shift:", softmax_1d_shift, "\n")

# 2D Array with temperature and shift
softmax_2d_temp = softmax(array_2d, temperature=2.0, axis=1
8000
)
softmax_2d_shift = softmax(array_2d, shift=False, axis=1)
print("Softmax for 2D array with temperature:", softmax_2d_temp)
print("Softmax for 2D array without shift:", softmax_2d_shift, "\n")

# 2D Array with large values
softmax_2d_large_values_temp = softmax(array_2d_large_values, temperature=2.0, axis=1)
softmax_2d_large_values_shift = softmax(array_2d_large_values, shift=False, axis=1)
print("Softmax for 2D array with large values and temperature:", softmax_2d_large_values_temp)
print("Softmax for 2D array with large values without shift:", softmax_2d_large_values_shift)

Outputs:

Softmax for 1D array with temperature: [0.50168776 0.30428901 0.19402324]
Softmax for 1D array without shift: [0.65900114 0.24243297 0.09856589] 

Softmax for 2D array with temperature: [[0.18632372 0.30719589 0.50648039]
 [0.18632372 0.30719589 0.50648039]]
Softmax for 2D array without shift: [[0.09003057 0.24472847 0.66524096]
 [0.09003057 0.24472847 0.66524096]] 

/workspaces/cleanlab/cleanlab/internal/numerics.py:36: RuntimeWarning: overflow encountered in exp
  exp_x = np.exp(x)
/workspaces/cleanlab/cleanlab/internal/numerics.py:37: RuntimeWarning: invalid value encountered in divide
  return exp_x / np.sum(exp_x, axis=axis, keepdims=True)
Softmax for 2D array with large values and temperature: [[0.18632372 0.30719589 0.50648039]
 [0.18632372 0.30719589 0.50648039]]
Softmax for 2D array with large values without shift: [[nan nan nan]
 [nan nan nan]]

@elisno elisno requested a review from huiwengoh August 23, 2023 18:18
@codecov
Copy link
codecov bot commented Aug 23, 2023

Codecov Report

Merging #826 (c6aa3c1) into master (fd2506c) will increase coverage by 0.09%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #826      +/-   ##
==========================================
+ Coverage   96.63%   96.73%   +0.09%     
==========================================
  Files          64       65       +1     
  Lines        5081     5077       -4     
  Branches      879      880       +1     
==========================================
+ Hits         4910     4911       +1     
+ Misses         88       85       -3     
+ Partials       83       81       -2     
Files Changed Coverage Δ
cleanlab/internal/multiannotator_utils.py 98.26% <100.00%> (ø)
cleanlab/internal/multilabel_scorer.py 100.00% <100.00%> (ø)
cleanlab/internal/numerics.py 100.00% <100.00%> (ø)
cleanlab/internal/object_detection_utils.py 100.00% <100.00%> (ø)
cleanlab/token_classification/rank.py 100.00% <100.00%> (ø)

... and 1 file with indirect coverage changes

@jwmueller
Copy link
Member

There's also softmin being used for segmentation right?

@jwmueller
Copy link
Member
jwmueller commented Aug 23, 2023

Could you once run each of the main user facing methods (ie find_label_issues analog from each tutorial notebook) and verify the outputs have not changed before vs after this PR?

I'm not 100% confident any introduced mathematical error would be caught without this manual check

huiwengoh and others added 7 commits August 23, 2023 15:31
- Update type hint for `min_entropy_ind` from built-in `int` to `np.intp`.
  - This refinement addresses a type compatibility warning.
  - `np.intp` is the integer type used by numpy for indexing and can differ in size from the built-in Python `int` depending on the platform (32-bit vs 64-bit).
  - Mypy highlighted this type hint discrepancy.
@jwmueller jwmueller requested a review from huiwengoh August 29, 2023 21:33
@elisno
Copy link
Member Author
elisno commented Aug 30, 2023

Could you once run each of the main user facing methods (ie find_label_issues analog from each tutorial notebook) and verify the outputs have not changed before vs after this PR?

I'm not 100% confident any introduced mathematical error would be caught without this manual check

Verified that the outputs of the corresponding methods are not affected.

@elisno elisno merged commit 4ce9f77 into cleanlab:master Aug 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0