Releases · pemistahl/lingua-py

Features

This release introduces an absolute confidence metric based on unique and most common ngrams for each supported language. It allows to build a language detector from a single language only. Such a detector serves as a binary classifier, telling you whether some text is written in your selected language or not. (#235)

Improvements

The new absolute confidence metric helps to improve accuracy in low accuracy mode. The mean of average detection accuracy (single words, word pairs and sentences combined) increases from 77% to 80%.
The rule-based algorithm for the recognition of Japanese texts has been improved. Texts including both Japanese and Chinese characters are now classified more often correctly as Japanese instead of Chinese.
The characters Щщ are now correctly identified as possible indicators for the Ukrainian language, leading to slightly higher accuracy when identifying Ukrainian texts.
The enums provided by this library can now be copied and pickled. (#199)
Members of the enums provided by this library can now be created dynamically with the function from_str(). (#225)
The library can now be used with Azure Artifacts. (#209)

Bug Fixes

Text spans created by LanguageDetector.detect_multiple_languages_of() sometimes skipped characters in the last span. This has been fixed.
The tokenization of texts written in the Devanagari alphabet was flawed. This has been fixed, leading to better detection accuracy for Hindi and Marathi.
The classes provided by this library are not part of the builtins module anymore but of the correct lingua module. (#255)

Compatibility

The newest Python 3.13 is now officially supported.
Support for Python 3.8 and 3.9 has been dropped. The lowest supported Python version is 3.10 now.

Improvements

The rule-based algorithm for the recognition of Japanese texts has been improved. Texts including both Japanese and Chinese characters are now classified more often correctly as Japanese instead of Chinese.

Bug Fixes

Text spans created by LanguageDetector.detect_multiple_languages_of() sometimes skipped characters in the last span. This has been fixed. (#247)

Please note: All improvements and bug fixes will also be part of the next Rust-based Python extension release 2.1.0.

Features

This release introduces an absolute confidence metric based on unique and most common ngrams for each supported language. It allows to build a language detector from a single language only. Such a detector serves as a binary classifier, telling you whether some text is written in your selected language or not. (#235)

Improvements

The new absolute confidence metric helps to improve accuracy in low accuracy mode. The mean of average detection accuracy (single words, word pairs and sentences combined) increases from 77% to 80%.

Bug Fixes

The tokenization of texts written in the Devanagari alphabet was flawed. This has been fixed, leading to better detection accuracy for Hindi and Marathi.

Compatibility

The newest Python 3.13 is now officially supported.
Support for Python 3.8 and 3.9 has been dropped. The lowest supported Python version is 3.10 now.

Please note: All new features and bug fixes will also be part of the next Rust-based Python extension release 2.1.0.

Improvements

The language models are now stored in dictionaries instead of NumPy arrays. This change leads to significantly improved runtime performance at the cost of higher memory consumption (up to 3 GB for all models). As the runtime performance was much too slow with the former approach, this change makes sense because adding more memory is quite cheap.
The language model files are now compressed with the Brotli algorithm which reduces the file size by 15 %, on average.
The characters Щщ are now correctly identified as possible indicators for the Ukrainian language, leading to slightly higher accuracy when identifying Ukrainian texts.

Miscellaneous

All dependencies have been updated to their latest versions.

Improvements

Type stubs for the Python bindings are now available, allowing better static code analysis, better code completion in supported IDEs and easier understanding of the library's API. (#197)

Bug Fixes

The method LanguageDetector.detect_multiple_languages_of still returned character indices instead of byte indices when only a single DetectionResult was produced. This has been fixed. (#203, #205)

Please note: Due to project size limits on PyPI, the Python wheels for previous version 2.0.1 had to be deleted. Please use 2.0.2 instead.

@boltonn

Bug Fixes

The method LanguageDetector.detect_multiple_languages_of returns byte indices. For creating string slices in Python, character indices are needed but were not provided. This resulted in incorrect DetectionResults for Python. This has been fixed now by converting the byte indices to character indices. Big thanks to @boltonn for the bug report. (#192)

Please note: Due to project size limits on PyPI, the Python wheels for previous version 2.0.0 had to be deleted. Please use 2.0.1 instead.

Features

Python bindings for the Rust implementation of Lingua have now replaced the pure Python implementation in order to benefit from Rust's performance in any Python software.
Parallel equivalents for all methods in LanguageDetector have been added to give the user the choice of using the library single-threaded or multi-threaded.

Miscellaneous

This release resolves some dependency issues so that the latest versions of dependencies NumPy, Pandas and Matplotib can be used with Python >= 3.9 while older versions are used with Python 3.8.
All dependencies have been updated to their latest versions.

Improvements

Processing the language models now performs a little faster by performing binary search on the language model NumPy arrays.

Bug Fixes

Several bugs in multiple languages detection have been fixed that caused incomplete results to be returned in several cases. (#143, #154)
A significant amount of Kazakh texts were incorrectly classified as Mongolian. This has been fixed. (#160)

Miscellaneous

A new section on performance tips has been added to the README.
All dependencies have been updated to their latest versions.

Improvements

After applying some internal optimizations, language detection is now faster, at least between 20% and 30%, approximately. For long input texts, the speed improvement is greater than for short input texts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Features

Improvements

Bug Fixes

Compatibility

Improvements

Bug Fixes

Features

Improvements

Bug Fixes

Compatibility

Improvements

Miscellaneous

Improvements

Bug Fixes

Bug Fixes

Contributors

Features

Miscellaneous

Improvements

Bug Fixes

Miscellaneous

Improvements

Releases: pemistahl/lingua-py

Lingua 2.1.0

Features

Improvements

Bug Fixes

Compatibility

Lingua 1.4.1

Improvements

Bug Fixes

Lingua 1.4.0

Features

Improvements

Bug Fixes

Compatibility

Lingua 1.3.5

Improvements

Miscellaneous

Lingua 2.0.2

Improvements

Bug Fixes

Lingua 2.0.1

Bug Fixes

Contributors

Lingua 2.0.0

Features

Lingua 1.3.4

Miscellaneous

Lingua 1.3.3

Improvements

Bug Fixes

Miscellaneous

Lingua 1.3.2

Improvements