-
Notifications
You must be signed in to change notification settings - Fork 504
Fix tokenizer length issue with DJL upgrade #2536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, this looks good to me
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #2536 +/- ##
============================================
+ Coverage 67.14% 67.18% +0.03%
- Complexity 1479 1481 +2
============================================
Files 219 219
Lines 12641 12645 +4
Branches 1528 1528
============================================
+ Hits 8488 8495 +7
+ Misses 3625 3624 -1
+ Partials 528 526 -2 ☔ View full report in Codecov by Sentry. |
This works, alternatively we could just set the truncation flag to "DO NOT TRUNCATE" options.put("truncation", "DO_NOT_TRUNCATE"); |
I've confirmed that all these affected regressions pass: python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-ar-aca > logs/log.miracl-v1.0-ar-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-bn-aca > logs/log.miracl-v1.0-bn-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-en-aca > logs/log.miracl-v1.0-en-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-es-aca > logs/log.miracl-v1.0-es-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-fa-aca > logs/log.miracl-v1.0-fa-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-fi-aca > logs/log.miracl-v1.0-fi-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-fr-aca > logs/log.miracl-v1.0-fr-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-hi-aca > logs/log.miracl-v1.0-hi-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-id-aca > logs/log.miracl-v1.0-id-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-ja-aca > logs/log.miracl-v1.0-ja-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-ko-aca > logs/log.miracl-v1.0-ko-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-ru-aca > logs/log.miracl-v1.0-ru-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-sw-aca > logs/log.miracl-v1.0-sw-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-te-aca > logs/log.miracl-v1.0-te-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-th-aca > logs/log.miracl-v1.0-th-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression miracl-v1.0-zh-aca > logs/log.miracl-v1.0-zh-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression mrtydi-v1.1-ar-aca > logs/log.mrtydi-v1.1-ar-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression mrtydi-v1.1-bn-aca > logs/log.mrtydi-v1.1-bn-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression mrtydi-v1.1-en-aca > logs/log.mrtydi-v1.1-en-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression mrtydi-v1.1-fi-aca > logs/log.mrtydi-v1.1-fi-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression mrtydi-v1.1-id-aca > logs/log.mrtydi-v1.1-id-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression mrtydi-v1.1-ja-aca > logs/log.mrtydi-v1.1-ja-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression mrtydi-v1.1-ko-aca > logs/log.mrtydi-v1.1-ko-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression mrtydi-v1.1-ru-aca > logs/log.mrtydi-v1.1-ru-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression mrtydi-v1.1-sw-aca > logs/log.mrtydi-v1.1-sw-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression mrtydi-v1.1-te-aca > logs/log.mrtydi-v1.1-te-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression mrtydi-v1.1-th-aca > logs/log.mrtydi-v1.1-th-aca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression msmarco-v1-passage.wp-ca > logs/log.msmarco-v1-passage.wp-ca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression msmarco-v1-doc.wp-ca > logs/log.msmarco-v1-doc.wp-ca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression msmarco-v1-doc-segmented.wp-ca > logs/log.msmarco-v1-doc-segmented.wp-ca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression dl19-passage.wp-ca > logs/log.dl19-passage.wp-ca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression dl19-doc.wp-ca > logs/log.dl19-doc.wp-ca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression dl19-doc-segmented.wp-ca > logs/log.dl19-doc-segmented.wp-ca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression dl20-passage.wp-ca > logs/log.dl20-passage.wp-ca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression dl20-doc.wp-ca > logs/log.dl20-doc.wp-ca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression dl20-doc-segmented.wp-ca > logs/log.dl20-doc-segmented.wp-ca.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression msmarco-v1-passage.wp-hgf > logs/log.msmarco-v1-passage.wp-hgf.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression msmarco-v1-doc.wp-hgf > logs/log.msmarco-v1-doc.wp-hgf.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression dl19-passage.wp-hgf > logs/log.dl19-passage.wp-hgf.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression dl19-doc.wp-hgf > logs/log.dl19-doc.wp-hgf.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression dl20-passage.wp-hgf > logs/log.dl20-passage.wp-hgf.txt 2>&1
python src/main/python/run_regression.py --index --verify --search --regression dl20-doc.wp-hgf > logs/log.dl20-doc.wp-hgf.txt 2>&1 |
Override default settings to be able to tokenize arbitrarily long sequences.
Another take for #2535
After digging through source code:
https://github.com/deepjavalibrary/djl/blob/master/extensions/tokenizers/src/main/java/ai/djl/huggingface/tokenizers/HuggingFaceTokenizer.java
I was able to override the tokenize arbitrarily long sequences.