8000 Releases · cleanlab/cleanlab · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Releases: cleanlab/cleanlab

v2.7.1 -- New issue manager and improved docs

27 Feb 15:33
db1d330
Compare
Choose a tag to compare

This release is non-breaking when upgrading from v2.7.0, mostly focused on documentation and testing improvements. The most notable update is a:

  • New identifier column issue manager – detects sequential numerical columns that might influence your model. This feature is available as a preview and requires additional setup to use with Datalab.

Other Updates:

  • 📖 Docs & Readme: Improved clarity.
  • 🛠 Test suite: More stability and consistency.

What's Changed

New Contributors

Full Changelog: v2.7.0...v2.7.1

v2.7.0 -- Broadening Data Quality Checks and ML Workflows

26 Sep 16:45
4a1a1fc
Compare
Choose a tag to compare

This release introduces new features and improvements aimed at helping users detect complex dataset issues and improve their ML models' robustness. As always, we maintain backward compatibility, making this release non-breaking when upgrading from v2.6.6. We continue to support Python 3.8-3.11 in this version, but support for Python 3.8 will be dropped in a future minor release.

Introducing Spurious Correlation Detection in Datalab

With this release, Datalab now detects spurious correlations in image datasets by default, helping users identify potentially misleading patterns that may lead to overfitting or reduced model generalization.

Spurious correlations occur when models pick up on patterns in the data that are coincidental rather than meaningful. For example, a model might incorrectly associate the background color with a particular label, leading to poor generalization on new data. Identifying these correlations helps ensure more reliable models by minimizing the risk of learning from irrelevant or misleading features.

Detecting spurious correlations in image datasets is straightforward:

from cleanlab import Datalab

lab = Datalab(data=image_dataset, label_name="label_column", image_key="image_column")

lab.find_issues()

lab.report()

You can find a more detailed workflow for finding spurious correlations in our documentation.

This new issue type aims to give users deeper insights into their data, enabling more robust model development.

New Tutorial: Improving ML Performance with Train and Test Set Curation

We've introduced a new tutorial that demonstrates how to carefully use cleanlab (via Datalab) for both training and test data. This approach helps ensure reliable ML model training and evaluation, particularly for noisy datasets.

You can find this tutorial in our documentation: Improving ML Performance via Data Curation with Train vs Test Splits.

Other Major Improvements

  • Optimized Internal Functions: Several internal optimizations have been made, including updates to clip_noise_rates, remove_noise_from_class, and clip_values functions, improving the overall efficiency of cleanlab.
  • Improved Underperforming Group Detection: Enhanced scoring for all underperforming groups, providing more accurate identification of problematic data subsets.

If you have ideas for new features or notice any bugs, we encourage you to open an Issue or Pull Request on our GitHub repository!

Change Log

Significant changes in this release include:

New Contributors

For a full list of changes, enhancements, and fixes, please refer to the Full Changelog.

v2.6.6

25 Jun 23:10
e604611
Compare
Choose a tag to compare

What's Changed

Full Changelog: v2.6.5...v2.6.6

v2.6.5

24 May 23:38
75a9d1c
Compare
Choose a tag to compare

What's Changed

  • Add end-to-end tests at the end of Datalab quickstart tutorial by @allincowell in #1118
  • Centralize existing functionality for constructing and correcting knn graphs in a separate module by @elisno in #1117, #1119, #1129
  • Optimize multiannotator.py for performance by @gogetron in #1077
  • Optimize value_counts function for performance improvement with missing classes by @gogetron in #1073
  • Improve test coverage for setting confident joint in CleanLearning by @elisno in #1123
  • Switch from np.isnan to pd.isna for null value check by @gogetron in #1096
  • Update pip install instruction in object detection tutorial by @elisno in #1126
  • Refine handling of underperforming_group issue type by @gogetron in #1099
  • Improve compatibility with sklearn 1.5 by removing the deprecated multi_class argument in LogisticRegression by @elisno in #1124
  • Display exact duplicate sets dynamically in tabular tutorial by @nelsonauner in #1128

New Contributors

Full Changelog: v2.6.4...v2.6.5

v2.6.4

07 May 18:23
81af417
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v2.6.3...v2.6.4

v2.6.3 - Enhanced scores for outliers and near-duplicates

19 Mar 22:08
b66a959
Compare
Choose a tag to compare

This release is non-breaking when upgrading from v2.6.2.

What's Changed

  • Updated image_key documentation by @sanjanag in #1048
  • Refine Scoring and Enhance Stability for Datasets with Identical Examples by @elisno in #1056
  • Add warning message about TensorFlow compatibility to docs by @elisno in #1057

Full Changelog: v2.6.2...v2.6.3

v2.6.2

08 Mar 16:18
e425448
Compare
Choose a tag to compare

This release is non-breaking when upgrading from v2.6.1.

What's Changed

  • Convert DataFrame features to numpy arrays in null value check by @elisno in #1045

Full Changelog: v2.6.1...v2.6.2

v2.6.1 -- Refined Regression Score and Fixes

07 Mar 14:02
6a98114
Compare
Choose a tag to compare

This release is non-breaking when upgrading from v2.6.0. Some noteworthy updates include:

  1. The label quality score in the cleanlab.regression module is improved to be more human-readable.
    • This only involves rescaling the scores to display a more human-interpretable range of scores, without affecting how your data points are ranked within a dataset according to these scores.
  2. Better address some edge-cases in Datalab.get_issues().

What's Changed

New Contributors

Full Changelog: v2.6.0...v2.6.1

v2.6.0 -- Elevating Data Insights: Comprehensive Issue Checks & Expanded ML Task Compatibility

16 Feb 06:21
3f07a88
Compare
Choose a tag to compare

This release is non-breaking when upgrading from v2.5.0, continuing our commitment to maintaining backward compatibility while introducing new features and improvements.
However, this release drops support for Python 3.7 while adding support for Python 3.11.

Enhancements to Datalab

In this update, Datalab, our dataset analysis platform, enhances its ability to identify various types of issues within your datasets. With this release, Datalab now detects additional types of issues by default, offering users a more comprehensive analysis. Specifically, it can now:

  • Identify null values in your dataset.
  • Detect class_imbalance.
  • Highlight an underperforming_group, which refers to a subset of data points where your model exhibits poorer performance compared to others.
    See our FAQ
    for more information on how to provide pre-defined groups for this issue type.

Additionally, Datalab can now optionally:

  • Assess the value of data points in your dataset using KNN-Shapley scores as a measure of data_valuation.

If you have ideas for new features or notice any bugs, we encourage you to open an Issue or Pull Request on our GitHub repository!

Expanded Datalab Support for New ML Tasks

With cleanlab v2.6.0, Datalab extends its support to new machine-learning tasks and introduces enhancements across the board.
This release introduces the task parameter in Datalab's API, enabling users to specify the type of machine learning task they are working on.

from cleanlab import Datalab

lab = Datalab(..., task="regression")

The tasks currently supported are:

  • classification (default): Includes all previously supported issue-checking capabilities based on pred_probs, features, or a knn_graph, and the new features introduced earlier.
  • regression (new):
    • Run specialized label error detection algorithms on regression datasets. You can see this in action in our updated regression tutorial.
    • Find other issues utilizing features or a knn_graph.
  • multilabel (new):
    • Detect label errors in multilabel classification datasets using pred_probs exclusively. Explore the updated capabilities in our multilabel tutorial.
    • Find various other types of issues based on features or a knn_graph.

Improved Object Detection Dataset Exploration

New functions have been introduced to enhance the exploration of object detection datasets, simplifying data comprehension and issue detection.
Learn how to leverage some of these functions in our object detection tutorial.

Other Major Improvements

  • Rescaled Near Duplicate and Outlier Scores:
    • Note that what matters for all cleanlab issue scores is not their absolute magnitudes but rather how these scores rank the data points from most to least severe instances of the issue. But based on user feedback, we have updated the near duplicate and outlier scores to display a more human-interpretable range of values. How these scores rank data points within a dataset remains unchanged.
  • Consistency in counting label issues:
    • cleanlab.dataset.health_summary() now returns the same number of issues as cleanlab.classification.find_label_issues() and cleanlab.count.num_label_issues().
  • Improved handling of non-iid issues:
    • The non-iid issue check in Datalab now handles pred_probs as input.
  • Better reporting in Datalab:
    • Simplified Datalab.report() now highlights only detected issue types. To view all checked issue types, use Datalab.report(show_all_issues=True).
  • Enhanced Handling of Binary Classification Tasks:
    • Examples with predicted probabilities close to 0.5 for both classes are no longer flagged as label errors, improving the handling of binary classification tasks.
  • Experimental Functionality:
    • cleanlab now offers experimental functionality for detecting label issues in span categorization tasks with a single class, enhancing its applicability in natural language processing projects.

New Contributors

We're thrilled to welcome new contributors to the cleanlab community! Your contributions help us improve and grow cleanlab:

Thank you for your valuable contributions! If you're interested in contributing, check out our contributing guide for ways to get involved.

Change Log

Significant changes in this release include:

Read more

v2.5.0 -- All major ML tasks now supported

11 Sep 14:44
d45537e
Compare
Choose a tag to compare

This release is non-breaking when upgrading from v2.4.0 (except for certain methods in cleanlab.experimental that have been moved, especially utility methods related to Datalab).

New ML tasks supported

Cleanlab now supports all of the most common ML tasks! This newest release adds dedicated support for the following types of datasets:

  • regression (finding errors in numeric data): see cleanlab.regression and the "noisy labels in regression" quickstart tutorial.
  • object detection: see cleanlab.object_detection and the "Object Detection" quickstart tutorial.
  • image segmentation: see cleanlab.segmentation and the "Semantic Segmentation tutorial.

Cleanlab previously already supported: multi-class classification, multi-label classification (image/document tagging), token classification (entity recognition, sequence prediction).

If there is another ML task you'd like to see this package support, please let us know (or even better open a Pull Request)!

Supporting these ML tasks properly required significant research and novel algorithms developed by our scientists. We have published papers on these for transparency and scientific rigor, check out the list in the README or learn more at:
https://cleanlab.ai/research/
https://cleanlab.ai/blog/

Improvements to Datalab

Datalab is a general platform for detecting all sorts of common issues in real-world data, and the best place to get started for running this library on your datasets.

This release introduces major improvements and new functionalities in Datalab that include the ability to:

  • Detect low-quality images in computer vision data (blurry, over/under-exposed, low-information, ...) via the integration of CleanVision.
  • Detect label issues even without pred_probs from a ML model (you can instead just provide features).
  • Flag rare classes in imbalanced classification datasets.
  • Audit unlabeled datasets.

Other major improvements

  • 50x speedup in the cleanlab.multiannotator code for analyzing data labeled by multiple annotators.
  • Out-of-Distribution detection based on pred_probs via the GEN algorithm which is particularly effective for datasets with tons of classes.
  • Many of the methods across the package to find label issues now support a low_memory option. When specified, it uses an approximate mini-batching algorithm that returns results much faster and requires much less RAM.

New Contributors

Transforming cleanlab into the first universal data-centric AI platform is a major effort and we need your help! Many easy ways to contribute are listed on our github or you can jump into the discussions on Slack. We immensely appreciate all of the contributors who've helped build this package into what it is today, especially:

Change Log

Full Changelog: v2.4.0...v2.5.0

0