-
Notifications
You must be signed in to change notification settings - Fork 909
No such file or directory: '../data/processed/tokenize/ta_ttb-ud-dev-mwt.json' #571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The training & dev data needs to be preprocessed with the prep_tokenize.sh
script first. Did you do that?
|
Hi, Now I get this issue after the tokenisation, but I do not see any issue with data set now: _Step 20000/ 20000 Loss: 0.002 Also I get the following when I use run_mwt.sh without any addition parameters: bash scripts/run_mwt.sh UD_Tamil-TTB btw, my intention is to train a multi-word expander. Therefore, I prepared the data mostly with mwt tokens, sample is given below. Hope it would not be an issue. 1-2 மனைவியும் _ _ _ _ _ _ _ _ 1-2 மனதையும் _ _ _ _ _ _ _ _ Thank you in advance, Sarves |
This seems to be in three parts. Let me try to answer each:
Hi, Thank you. I think its good to include [prep_tokenize] in the
documentation, sorry, if I have missed it.
Btw, for prep_tokenize we need to specify whether we are preprocessing
test / dev / train set. But, prep_mwt.sh worked without it.
Oops, sorry about that. There's a bit more explanation on this page:
https://stanfordnlp.github.io/stanza/new_language.html
At any rate, I am busy overhauling the scripts to be in python, and I'll
update the documentation when we make a new release that includes the new
scripts.
__main__.UDError: There is an empty FORM in the CoNLL-U file ta_ttb
--batch_size 32 --dropout 0.33
That is a little weird. I haven't seen that before. Assuming you have the
standard directory layout, would you let me know what is in
data/tokenize
There should be a .gold and a .pred dev file in there after running the
tokenize script. What do those files look like?
Therefore, I prepared the data mostly with mwt tokens, sample is given
below. Hope it would not be an issue.
This may be the same error as the previous error, and looking at the sample
data you have included, I think I understand the problem. The conllu eval
script expects some specific formatting in the data file. In particular,
it expects dependencies, even if you aren't working with dependencies. The
easiest way to fake that is to make word 1 the root (point to 0), then each
subsequent word N points to word N-1.
1-2 மனதையும் _ _ _ _ _ _ _ _
1 மனதை _ _ _ _ 0 root 0:root _
2 உம் _ _ _ _ 1 dep 1:dep _
etc etc
Actually, this may solve both problems #2 and #3.
|
Hi Thanks, it worked. Thanks |
Uh oh!
There was an error while loading. Please reload this page.
Hi
Why tokeniser script looks for ud_ttb-ud-dev-mwt.json when try to train a tokeniser? This has to be created during the training, isnt it?
Where can I find detail documentation about model training scripts and their parameters (I looked at https://stanfordnlp.github.io/stanza/training.html#training-with-scripts)
Thank you
bash scripts/run_tokenize.sh UD_Tamil-TTB --batch_size 100 --dropout 0.33
Running tokenizer with --batch_size 100 --dropout 0.33...
Running tokenizer in train mode
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/tokenizer.py", line 183, in
main()
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/tokenizer.py", line 93, in main
train(args)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/tokenizer.py", line 98, in train
mwt_dict = load_mwt_dict(args['mwt_json_file'])
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/tokenize/utils.py", line 16, in load_mwt_dict
with open(filename, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '../data/processed/tokenize/ta_ttb-ud-dev-mwt.json'
Running tokenizer in predict mode
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/tokenizer.py", line 183, in
main()
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/tokenizer.py", line 95, in main
evaluate(args)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/tokenizer.py", line 161, in evaluate
mwt_dict = load_mwt_dict(args['mwt_json_file'])
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/tokenize/utils.py", line 16, in load_mwt_dict
with open(filename, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '../data/processed/tokenize/ta_ttb-ud-dev-mwt.json'
Traceback (most recent call last):
File "stanza/utils/conll18_ud_eval.py", line 532, in
main()
File "stanza/utils/conll18_ud_eval.py", line 500, in main
evaluation = evaluate_wrapper(args)
File "stanza/utils/conll18_ud_eval.py", line 482, in evaluate_wrapper
gold_ud = load_conllu_file(args.gold_file)
File "stanza/utils/conll18_ud_eval.py", line 478, in load_conllu_file
return load_conllu(_file)
File "stanza/utils/conll18_ud_eval.py", line 209, in load_conllu
process_word(word)
File "stanza/utils/conll18_ud_eval.py", line 205, in process_word
process_word(parent)
File "stanza/utils/conll18_ud_eval.py", line 205, in process_word
process_word(parent)
File "stanza/utils/conll18_ud_eval.py", line 205, in process_word
process_word(parent)
File "stanza/utils/conll18_ud_eval.py", line 197, in process_word
raise UDError("There is a cycle in a sentence")
main.UDError: There is a cycle in a sentence
ta_ttb --batch_size 100 --dropout 0.33
The text was updated successfully, but these errors were encountered: