Unify softmax implementations #826

elisno · 2023-08-23T18:17:25Z

Summary

This PR refactors the usage of the softmax function across various modules within the cleanlab package. A dedicated softmax function is introduced to enhance code reusability and improve numerical stability, ensuring the codebase remains robust and maintainable.

Changes

Introduce a softmax utility function: Located in cleanlab/internal/numerics.py, this function includes options for softmax temperature, selection of axis, and numeric stability shift.
Refactored Existing Code: Replaced all explicit implementations of softmax with calls to the new softmax utility function across multiple modules (multiannotator_utils, multilabel_scorer, object_detection_utils, and token_classification/rank).
(secondary) Improved find_best_temp_scaler logic: Refactored the temperature scaling in the multiannotator_utils module to use the new softmax function. This also led to the introduction of the helper function _set_fine_search_range to better organize and segregate logic.

The PR assumes that the current testing suite covers these modules and that any potential deviations would be flagged.
It does not treat related implementations like "softmin", etc.

Usage

The softmax function can be used for both 1D and 2D arrays.

Here's a quick demonstration:

from cleanlab.internal.numerics import softmax
import numpy as np

# Sample data
array_1d = np.array([2.0, 1.0, 0.1])
array_2d = np.array([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])

array_2d_large_values = array_2d + 1e10

# 1D Array with temperature and shift
softmax_1d_temp = softmax(array_1d, temperature=2.0)
softmax_1d_shift = softmax(array_1d, shift=False)
print("Softmax for 1D array with temperature:", softmax_1d_temp)
print("Softmax for 1D array without shift:", softmax_1d_shift, "\n")

# 2D Array with temperature and shift
softmax_2d_temp = softmax(array_2d, temperature=2.0, axis=1
8000
)
softmax_2d_shift = softmax(array_2d, shift=False, axis=1)
print("Softmax for 2D array with temperature:", softmax_2d_temp)
print("Softmax for 2D array without shift:", softmax_2d_shift, "\n")

# 2D Array with large values
softmax_2d_large_values_temp = softmax(array_2d_large_values, temperature=2.0, axis=1)
softmax_2d_large_values_shift = softmax(array_2d_large_values, shift=False, axis=1)
print("Softmax for 2D array with large values and temperature:", softmax_2d_large_values_temp)
print("Softmax for 2D array with large values without shift:", softmax_2d_large_values_shift)

Outputs:

Softmax for 1D array with temperature: [0.50168776 0.30428901 0.19402324]
Softmax for 1D array without shift: [0.65900114 0.24243297 0.09856589] 

Softmax for 2D array with temperature: [[0.18632372 0.30719589 0.50648039]
 [0.18632372 0.30719589 0.50648039]]
Softmax for 2D array without shift: [[0.09003057 0.24472847 0.66524096]
 [0.09003057 0.24472847 0.66524096]] 

/workspaces/cleanlab/cleanlab/internal/numerics.py:36: RuntimeWarning: overflow encountered in exp
  exp_x = np.exp(x)
/workspaces/cleanlab/cleanlab/internal/numerics.py:37: RuntimeWarning: invalid value encountered in divide
  return exp_x / np.sum(exp_x, axis=axis, keepdims=True)
Softmax for 2D array with large values and temperature: [[0.18632372 0.30719589 0.50648039]
 [0.18632372 0.30719589 0.50648039]]
Softmax for 2D array with large values without shift: [[nan nan nan]
 [nan nan nan]]

codecov · 2023-08-23T18:26:57Z

Codecov Report

Merging #826 (c6aa3c1) into master (fd2506c) will increase coverage by 0.09%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #826      +/-   ##
==========================================
+ Coverage   96.63%   96.73%   +0.09%     
==========================================
  Files          64       65       +1     
  Lines        5081     5077       -4     
  Branches      879      880       +1     
==========================================
+ Hits         4910     4911       +1     
+ Misses         88       85       -3     
+ Partials       83       81       -2

Files Changed	Coverage Δ
cleanlab/internal/multiannotator_utils.py	`98.26% <100.00%> (ø)`
cleanlab/internal/multilabel_scorer.py	`100.00% <100.00%> (ø)`
cleanlab/internal/numerics.py	`100.00% <100.00%> (ø)`
cleanlab/internal/object_detection_utils.py	`100.00% <100.00%> (ø)`
cleanlab/token_classification/rank.py	`100.00% <100.00%> (ø)`

... and 1 file with indirect coverage changes

cleanlab/internal/multiannotator_utils.py

cleanlab/internal/numerics.py

jwmueller · 2023-08-23T19:28:10Z

There's also softmin being used for segmentation right?

jwmueller · 2023-08-23T19:30:00Z

Could you once run each of the main user facing methods (ie find_label_issues analog from each tutorial notebook) and verify the outputs have not changed before vs after this PR?

I'm not 100% confident any introduced mathematical error would be caught without this manual check

- Update type hint for `min_entropy_ind` from built-in `int` to `np.intp`. - This refinement addresses a type compatibility warning. - `np.intp` is the integer type used by numpy for indexing and can differ in size from the built-in Python `int` depending on the platform (32-bit vs 64-bit). - Mypy highlighted this type hint discrepancy.

elisno · 2023-08-30T00:20:32Z

Could you once run each of the main user facing methods (ie find_label_issues analog from each tutorial notebook) and verify the outputs have not changed before vs after this PR?

I'm not 100% confident any introduced mathematical error would be caught without this manual check

Verified that the outputs of the corresponding methods are not affected.

unify softmax implementations

7d86740

elisno requested a review from huiwengoh August 23, 2023 18:18

elisno added 3 commits August 23, 2023 18:20

rermove unused import

583a031

add return type annotation to helper function

ddd4b23

extract variables out of loop

698cd06

huiwengoh added 2 commits August 23, 2023 15:07

remove clipping from log_pred_probs computation

181e987

remove comment about clipping

3582b0e

huiwengoh reviewed Aug 23, 2023

View reviewed changes

cleanlab/internal/multiannotator_utils.py Show resolved Hide resolved

fix mypy

929e9e8

huiwengoh reviewed Aug 23, 2023

View reviewed changes

cleanlab/internal/numerics.py Outdated Show resolved Hide resolved

huiwengoh and others added 7 commits August 23, 2023 15:31

revert mypy edit

ccd61c9

rearrange imports

d17bf16

Merge branch 'master' into softmax

517e4aa

change default value of shift argument for softmax to False

c89ee47

add unit tests for softmax

2e546b6

apply black formatter

c6aa3c1

jwmueller requested a review from huiwengoh August 29, 2023 21:33

elisno merged commit 4ce9f77 into cleanlab:master Aug 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unify softmax implementations #826

Unify softmax implementations #826

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Unify softmax implementations #826

Unify softmax implementations #826

Uh oh!

Conversation

Summary

Changes

Usage

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!