8000 No such file or directory: '../data/processed/tokenize/ta_ttb-ud-dev-mwt.json' · Issue #571 · stanfordnlp/stanza · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

No such file or directory: '../data/processed/tokenize/ta_ttb-ud-dev-mwt.json' #571

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
sarves opened this issue Dec 19, 2020 · 4 comments
Closed
Labels

Comments

@sarves
Copy link
sarves commented Dec 19, 2020

Hi

Why tokeniser script looks for ud_ttb-ud-dev-mwt.json when try to train a tokeniser? This has to be created during the training, isnt it?

Where can I find detail documentation about model training scripts and their parameters (I looked at https://stanfordnlp.github.io/stanza/training.html#training-with-scripts)

Thank you
bash scripts/run_tokenize.sh UD_Tamil-TTB --batch_size 100 --dropout 0.33
Running tokenizer with --batch_size 100 --dropout 0.33...
Running tokenizer in train mode
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/tokenizer.py", line 183, in
main()
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/tokenizer.py", line 93, in main
train(args)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/tokenizer.py", line 98, in train
mwt_dict = load_mwt_dict(args['mwt_json_file'])
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/tokenize/utils.py", line 16, in load_mwt_dict
with open(filename, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '../data/processed/tokenize/ta_ttb-ud-dev-mwt.json'
Running tokenizer in predict mode
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/tokenizer.py", line 183, in
main()
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/tokenizer.py", line 95, in main
evaluate(args)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/tokenizer.py", line 161, in evaluate
mwt_dict = load_mwt_dict(args['mwt_json_file'])
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/tokenize/utils.py", line 16, in load_mwt_dict
with open(filename, 'r') as f:
FileNotFoundError: [Errno 2] No such file or directory: '../data/processed/tokenize/ta_ttb-ud-dev-mwt.json'
Traceback (most recent call last):
File "stanza/utils/conll18_ud_eval.py", line 532, in
main()
File "stanza/utils/conll18_ud_eval.py", line 500, in main
evaluation = evaluate_wrapper(args)
File "stanza/utils/conll18_ud_eval.py", line 482, in evaluate_wrapper
gold_ud = load_conllu_file(args.gold_file)
File "stanza/utils/conll18_ud_eval.py", line 478, in load_conllu_file
return load_conllu(_file)
File "stanza/utils/conll18_ud_eval.py", line 209, in load_conllu
process_word(word)
File "stanza/utils/conll18_ud_eval.py", line 205, in process_word
process_word(parent)
File "stanza/utils/conll18_ud_eval.py", line 205, in process_word
process_word(parent)
File "stanza/utils/conll18_ud_eval.py", line 205, in process_word
process_word(parent)
File "stanza/utils/conll18_ud_eval.py", line 197, in process_word
raise UDError("There is a cycle in a sentence")
main.UDError: There is a cycle in a sentence
ta_ttb --batch_size 100 --dropout 0.33

@AngledLuffa
Copy link
Collaborator
AngledLuffa commented Dec 19, 2020 via email

@sarves
Copy link
Author
sarves commented Dec 21, 2020

Hi,
Thank you. I think its good to include this in the documentation, sorry, if I have missed it.
Btw, for prep_tokenize we need to specify whether we are preprocessing test / dev / train set. But, prep_mwt.sh worked without it.
Please see it.

Now I get this issue after the tokenisation, but I do not see any issue with data set now:

_Step 20000/ 20000 Loss: 0.002
ta_ttb: token F1 = 99.80, sentence F1 = 95.22, mwt F1 = 95.77
Dev score: 97.449
Best dev score=0.9881956158953555 at step 400
Running tokenizer in predict mode
1 sentences loaded.
OOV rate: 0.000% ( 0/ 6276)
Traceback (most recent call last):
File "stanza/utils/conll18_ud_eval.py", line 532, in
main()
File "stanza/utils/conll18_ud_eval.py", line 500, in main
evaluation = evaluate_wrapper(args)
File "stanza/utils/conll18_ud_eval.py", line 482, in evaluate_wrapper
gold_ud = load_conllu_file(args.gold_file)
File "stanza/utils/conll18_ud_eval.py", line 478, in load_conllu_file
return load_conllu(file)
File "stanza/utils/conll18_ud_eval.py", line 239, in load_conllu
raise UDError("There is an empty FORM in the CoNLL-U file")
main.UDError: There is an empty FORM in the CoNLL-U file
ta_ttb --batch_size 32 --dropout 0.33

Also I get the following when I use run_mwt.sh without any addition parameters:

bash scripts/run_mwt.sh UD_Tamil-TTB
Running ...
Running MWT expander in train mode
max_dec_len: 34
Loading data with batch size 50...
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/mwt_expander.py", line 255, in
main()
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/mwt_expander.py", line 89, in main
train(args)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/mwt_expander.py", line 97, in train
train_doc = Document(CoNLL.conll2dict(input_file=args['train_file']))
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/utils/conll.py", line 95, in conll2dict
doc_conll = CoNLL.load_conll(infile, ignore_gapping)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/utils/conll.py", line 44, in load_conll
assert len(array) == FIELD_NUM,
AssertionError: Cannot parse CoNLL line: expecting 10 fields, 2 found.
Running MWT expander in predict mode
Building an attentional Seq2Seq model...
Using a Bi-LSTM encoder
Using soft attention for LSTM.
Finetune all embeddings.
max_dec_len: 34
Loading data with batch size 50...
3 batches created.
Running the seq2seq model...
/pytorch/aten/src/ATen/native/BinaryOps.cpp:81: UserWarning: Integer division of tensors using div or / is deprecated, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead.
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/mwt_expander.py", line 255, in
main()
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/mwt_expander.py", line 91, in main
evaluate(args)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/mwt_expander.py", line 248, in evaluate
_, _, score = scorer.score(system_pred_file, gold_file)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/mwt/scorer.py", line 8, in score
evaluation = ud_scores(gold_conllu_file, system_conllu_file)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/models/common/utils.py", line 52, in ud_scores
gold_ud = ud_eval.load_conllu_file(gold_conllu_file)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/utils/conll18_ud_eval.py", line 478, in load_conllu_file
return load_conllu(_file)
File "/home/sarves/Stanza/stanza-train-master/stanza/stanza/utils/conll18_ud_eval.py", line 239, in load_conllu
raise UDError("There is an empty FORM in the CoNLL-U file")
stanza.utils.conll18_ud_eval.UDError: There is an empty FORM in the CoNLL-U file
Traceback (most recent call last):
File "stanza/utils/conll18_ud_eval.py", line 532, in
main()
File "stanza/utils/conll18_ud_eval.py", line 500, in main
evaluation = evaluate_wrapper(args)
File "stanza/utils/conll18_ud_eval.py", line 482, in evaluate_wrapper
gold_ud = load_conllu_file(args.gold_file)
File "stanza/utils/conll18_ud_eval.py", line 478, in load_conllu_file
return load_conllu(_file)
File "stanza/utils/conll18_ud_eval.py", line 239, in load_conllu
raise UDError("There is an empty FORM in the CoNLL-U file")
main.UDError: There is an empty FORM in the CoNLL-U file
ta_ttb

btw, my intention is to train a multi-word expander. Therefore, I prepared the data mostly with mwt tokens, sample is given below. Hope it would not be an issue.

1-2 மனைவியும் _ _ _ _ _ _ _ _
1 மனைவி _ _ _ _ _ _ _ _
2 உம் _ _ _ _ _ _ _ _

1-2 மனதையும் _ _ _ _ _ _ _ _
1 மனதை _ _ _ _ _ _ _ _
2 உம் _ _ _ _ _ _ _ _

Thank you in advance, Sarves

@AngledLuffa
Copy link
Collaborator
AngledLuffa commented Dec 22, 2020 via email

@sarves
Copy link
Author
sarves commented Dec 30, 2020

Hi

Thanks, it worked.
Also, for some reason I had to delete everything in data/mwt.
Now it works.

Thanks

@sarves sarves closed this as completed Dec 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants
0