No such file or directory: '../data/processed/tokenize/ta_ttb-ud-dev-mwt.json' #571

sarves · 2020-12-19T21:07:58Z

Hi

Why tokeniser script looks for ud_ttb-ud-dev-mwt.json when try to train a tokeniser? This has to be created during the training, isnt it?

Where can I find detail documentation about model training scripts and their parameters (I looked at https://stanfordnlp.github.io/stanza/training.html#training-with-scripts)

Thank you
bash scripts/run_tokenize.sh UD_Tamil-TTB --batch_size 100 --dropout 0.33
Running tokenizer with --batch_size 100 --dropout 0.33...
Running tokenizer in train mode
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/tokenizer.py", line 183, in
main()
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/tokenizer.py", line 93, in main
train(args)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/tokenizer.py", line 98, in train
mwt_dict = load_mwt_dict(args['mwt_json_file'])
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/tokenize/utils.py", line 16, in load_mwt_dict
with open(filename, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '../data/processed/tokenize/ta_ttb-ud-dev-mwt.json'
Running tokenizer in predict mode
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/tokenizer.py", line 183, in
main()
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/tokenizer.py", line 95, in main
evaluate(args)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/tokenizer.py", line 161, in evaluate
mwt_dict = load_mwt_dict(args['mwt_json_file'])
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/tokenize/utils.py", line 16, in load_mwt_dict
with open(filename, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '../data/processed/tokenize/ta_ttb-ud-dev-mwt.json'
Traceback (most recent call last):
File "stanza/utils/conll18_ud_eval.py", line 532, in
main()
File "stanza/utils/conll18_ud_eval.py", line 500, in main
evaluation = evaluate_wrapper(args)
File "stanza/utils/conll18_ud_eval.py", line 482, in evaluate_wrapper
gold_ud = load_conllu_file(args.gold_file)
File "stanza/utils/conll18_ud_eval.py", line 478, in load_conllu_file
return load_conllu(_file)
File "stanza/utils/conll18_ud_eval.py", line 209, in load_conllu
process_word(word)
File "stanza/utils/conll18_ud_eval.py", line 205, in process_word
process_word(parent)
File "stanza/utils/conll18_ud_eval.py", line 205, in process_word
process_word(parent)
File "stanza/utils/conll18_ud_eval.py", line 205, in process_word
process_word(parent)
File "stanza/utils/conll18_ud_eval.py", line 197, in process_word
raise UDError("There is a cycle in a sentence")
main.UDError: There is a cycle in a sentence
ta_ttb --batch_size 100 --dropout 0.33

AngledLuffa · 2020-12-19T21:16:01Z

The training & dev data needs to be preprocessed with the prep_tokenize.sh script first. Did you do that?

sarves · 2020-12-21T10:36:04Z

Hi,
Thank you. I think its good to include this in the documentation, sorry, if I have missed it.
Btw, for prep_tokenize we need to specify whether we are preprocessing test / dev / train set. But, prep_mwt.sh worked without it.
Please see it.

Now I get this issue after the tokenisation, but I do not see any issue with data set now:

_Step 20000/ 20000 Loss: 0.002
ta_ttb: token F1 = 99.80, sentence F1 = 95.22, mwt F1 = 95.77
Dev score: 97.449
Best dev score=0.9881956158953555 at step 400
Running tokenizer in predict mode
1 sentences loaded.
OOV rate: 0.000% ( 0/ 6276)
Traceback (most recent call last):
File "stanza/utils/conll18_ud_eval.py", line 532, in
main()
File "stanza/utils/conll18_ud_eval.py", line 500, in main
evaluation = evaluate_wrapper(args)
File "stanza/utils/conll18_ud_eval.py", line 482, in evaluate_wrapper
gold_ud = load_conllu_file(args.gold_file)
File "stanza/utils/conll18_ud_eval.py", line 478, in load_conllu_file
return load_conllu(file)
File "stanza/utils/conll18_ud_eval.py", line 239, in load_conllu
raise UDError("There is an empty FORM in the CoNLL-U file")
main.UDError: There is an empty FORM in the CoNLL-U file
ta_ttb --batch_size 32 --dropout 0.33

Also I get the following when I use run_mwt.sh without any addition parameters:

bash scripts/run_mwt.sh UD_Tamil-TTB
Running ...
Running MWT expander in train mode
max_dec_len: 34
Loading data with batch size 50...
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/mwt_expander.py", line 255, in
main()
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/mwt_expander.py", line 89, in main
train(args)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/mwt_expander.py", line 97, in train
train_doc = Document(CoNLL.conll2dict(input_file=args['train_file']))
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/utils/conll.py", line 95, in conll2dict
doc_conll = CoNLL.load_conll(infile, ignore_gapping)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/utils/conll.py", line 44, in load_conll
assert len(array) == FIELD_NUM,
AssertionError: Cannot parse CoNLL line: expecting 10 fields, 2 found.
Running MWT expander in predict mode
Building an attentional Seq2Seq model...
Using a Bi-LSTM encoder
Using soft attention for LSTM.
Finetune all embeddings.
max_dec_len: 34
Loading data with batch size 50...
3 batches created.
Running the seq2seq model...
/pytorch/aten/src/ATen/native/BinaryOps.cpp:81: UserWarning: Integer division of tensors using div or / is deprecated, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead.
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/mwt_expander.py", line 255, in
main()
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/mwt_expander.py", line 91, in main
evaluate(args)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/mwt_expander.py", line 248, in evaluate
_, _, score = scorer.score(system_pred_file, gold_file)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/mwt/scorer.py", line 8, in score
evaluation = ud_scores(gold_conllu_file, system_conllu_file)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/common/utils.py", line 52, in ud_scores
gold_ud = ud_eval.load_conllu_file(gold_conllu_file)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/utils/conll18_ud_eval.py", line 478, in load_conllu_file
return load_conllu(_file)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/utils/conll18_ud_eval.py", line 239, in load_conllu
raise UDError("There is an empty FORM in the CoNLL-U file")
stanza.utils.conll18_ud_eval.UDError: There is an empty FORM in the CoNLL-U file
Traceback (most recent call last):
File "stanza/utils/conll18_ud_eval.py", line 532, in
main()
File "stanza/utils/conll18_ud_eval.py", line 500, in main
evaluation = evaluate_wrapper(args)
File "stanza/utils/conll18_ud_eval.py", line 482, in evaluate_wrapper
gold_ud = load_conllu_file(args.gold_file)
File "stanza/utils/conll18_ud_eval.py", line 478, in load_conllu_file
return load_conllu(_file)
File "stanza/utils/conll18_ud_eval.py", line 239, in load_conllu
raise UDError("There is an empty FORM in the CoNLL-U file")
main.UDError: There is an empty FORM in the CoNLL-U file
ta_ttb

btw, my intention is to train a multi-word expander. Therefore, I prepared the data mostly with mwt tokens, sample is given below. Hope it would not be an issue.

1-2 மனைவியும் _ _ _ _ _ _ _ _
1 மனைவி _ _ _ _ _ _ _ _
2 உம் _ _ _ _ _ _ _ _

1-2 மனதையும் _ _ _ _ _ _ _ _
1 மனதை _ _ _ _ _ _ _ _
2 உம் _ _ _ _ _ _ _ _

Thank you in advance, Sarves

AngledLuffa · 2020-12-22T04:58:25Z

This seems to be in three parts. Let me try to answer each:

Hi, Thank you. I think its good to include [prep_tokenize] in the

documentation, sorry, if I have missed it.

Btw, for prep_tokenize we need to specify whether we are preprocessing

test / dev / train set. But, prep_mwt.sh worked without it. Oops, sorry about that. There's a bit more explanation on this page: https://stanfordnlp.github.io/stanza/new_language.html At any rate, I am busy overhauling the scripts to be in python, and I'll update the documentation when we make a new release that includes the new scripts.

__main__.UDError: There is an empty FORM in the CoNLL-U file ta_ttb

--batch_size 32 --dropout 0.33 That is a little weird. I haven't seen that before. Assuming you have the standard directory layout, would you let me know what is in data/tokenize There should be a .gold and a .pred dev file in there after running the tokenize script. What do those files look like?

Therefore, I prepared the data mostly with mwt tokens, sample is given

below. Hope it would not be an issue. This may be the same error as the previous error, and looking at the sample data you have included, I think I understand the problem. The conllu eval script expects some specific formatting in the data file. In particular, it expects dependencies, even if you aren't working with dependencies. The easiest way to fake that is to make word 1 the root (point to 0), then each subsequent word N points to word N-1. 1-2 மனதையும் _ _ _ _ _ _ _ _ 1 மனதை _ _ _ _ 0 root 0:root _ 2 உம் _ _ _ _ 1 dep 1:dep _ etc etc Actually, this may solve both problems #2 and #3.

sarves · 2020-12-30T01:06:33Z

Hi

Thanks, it worked.
Also, for some reason I had to delete everything in data/mwt.
Now it works.

Thanks

sarves added the question label Dec 19, 2020

sarves closed this as completed Dec 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

No such file or directory: '../data/processed/tokenize/ta_ttb-ud-dev-mwt.json' #571

No such file or directory: '../data/processed/tokenize/ta_ttb-ud-dev-mwt.json' #571

Uh oh!

Uh oh!

Uh oh!

Uh oh!

No such file or directory: '../data/processed/tokenize/ta_ttb-ud-dev-mwt.json' #571

No such file or directory: '../data/processed/tokenize/ta_ttb-ud-dev-mwt.json' #571

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!