8000 New WikiTables language by matt-gardner · Pull Request #2350 · allenai/allennlp · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
This repository was archived by the owner on Dec 16, 2022. It is now read-only.

New WikiTables language #2350

Merged
merged 7 commits into from
Jan 17, 2019
Merged

Conversation

matt-gardner
Copy link
Contributor

This PR re-writes the variable-free language using the new API that does automatic grammar induction. I improved some things while porting the NLVR language, and this PR depends on those improvements, so it's not ready for review until #2319 is merged. But I'm opening this now so I can test it on CI.

This PR is a net reduction of ~600 lines of code, and the code that's left is way simpler. I'm pretty happy with how this rewrite of the grammar stuff worked out.

A few points to note on this new language:

  • I considered fixing the row / list of rows distinction (as mentioned here), but I thought for this initial PR I'd stick as closely as possible to the original language. We can change it later if we want to. Though, for this particular decision, I'm not sure if distinguishing the two actually helps you - you remove some spurious programs, but it comes at the expense of a deeper type system, because you have to add function calls to get from List[Row] to Row, and the type of select gets muddy.
  • Related a bit to the above, I removed the "multi-match" stuff entirely. Instead, I made the multi-matching happen at the terminal productions. This was a lot easier to do than implementing multi-match stuff in the new grammar induction API, and, more importantly, I think the new way makes more sense from a modeling perspective than the previous way. For instance, select has type <List[Row],Column:List[str]>. With the multi-match implementation, you would get three productions for List[str]: ['List[str] -> [<List[Row],Column:List[str]>, List[Row], DateColumn]', 'List[str] -> [<List[Row],Column:List[str]>, List[Row], NumberColumn]', 'List[str] -> [<List[Row],Column:List[str]>, List[Row], StringColumn]']. This means that the model has to decide when it's expanding List[str] what kind of column it is going to use. Except the column it uses is really a linking decision, and we have a lot more capacity to make that decision when we're predicting terminals, using our linking mechanism. So we're asking the List[str] embedding to do a whole lot, more than it should be doing. The way I implemented this, List[str] has just one production from select: ['List[str] -> [<List[Row],Column:List[str]>, List[Row], Column], and now there's a Column pre-terminal that expands to columns of all types. That is, the "multi-matching" happens at the terminals (where they are parameterized with a linking mechanism), instead of high up in the abstract syntax tree. This paragraph describes the modeling reasons for doing it this way, and what it means for how the grammar is different. To see how it's implemented, look at the changes I made to domain_language.py.
  • While I did not distinguish between List[Row] and Row, I did make a minor change to the language, distinguishing between List[str] (the result of doing a select or a mode) and str (the words you can use from the question). filter_in and filter_not_in using str instead of List[str].
    This seemed like an obvious improvement, as question words are always strings, not lists. Though, thinking about it more, maybe I messed up - this means you can't filter based on the output of a select. @pdasigi, do we currently support that? Do we need it? It wasn't in any of the parsing tests, or it would have failed. This is easy to change back, if we need to (we can just mess with the types a bit to coerce the strings to be treated as List[str]).

@matt-gardner matt-gardner changed the title [Depends on #2319] New WikiTables language New WikiTables language Jan 15, 2019
@matt-gardner
Copy link
Contributor Author

Ok, @pdasigi, this is rebased and should be ready for review.

@matt-gardner matt-gardner requested a review from pdasigi January 15, 2019 05:18
Copy link
Member
@pdasigi pdasigi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

@pdasigi
Copy link
Member
pdasigi commented Jan 16, 2019

@matt-gardner Oh, and regarding your question about str vs List[str], I think it is better to have everything be a List[str]. More generally, the original implementation of this language involved evaluating the "expression" passed as the filter value in all filtering functions, and thus the value used for filtering could itself be a logical form with select, mode, etc. That said, I think it is okay to merge this PR now and improve the language to include those features later. The language in this repository is not the latest version of our implementation anyway (the one in allennlp-weak-supervision-research has some improvements on top of this). Actually, you may want to add a TODO somewhere saying that some features in the language are missing.

@matt-gardner matt-gardner merged commit ae63a2a into allenai:master Jan 17, 2019
@matt-gardner matt-gardner deleted the wtq_language branch January 17, 2019 01:14
@berl
8000
ino
Copy link
berlino commented Jan 24, 2019

@matt-gardner Regarding differentiating between "rows" and "a list of rows", grammar indeed becomes messy after my attempts. But I think a more realistic way would be enforce some constraints during the execution phase. For instance, it would fail if the function "previous/next" gets a list row with more than one element; on the contrary, functions like "average/sum/min" should take a list of rows with more than one element (though it's possible to have only one). Consider the cases like "min (argmax ... ), average(argmin(...))", they are very likely to be spurious programs. Does it make sense to you?

@matt-gardner
Copy link
Contributor Author

Yes, failing during execution is easy - we just need to raise an ExecutionError at the appropriate time, and be sure we catch it. I suppose you could use this to filter out a few spurious programs during decoding. I have no intuition for how often this actually happens, though.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0