-
-
Notifications
You must be signed in to change notification settings - Fork 163
feat(override): allow addendum data to override imported place data #1494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is a workaround for the issues we've seen in pelias with scoring on records that have multiple names. The issue is that elasticsearch calculates the term frequency over all the names, so a place like ["Sutter St Cafe", "The Sutter St Cafe"] is more likely to match "Sutter St" than a record with just the name ["Sutter St"] This change allows us to create multiple records, one per name, but to specify in the addendum data what the canonical name is, so that we can generate lots of aliases for records and not worry too much about the term frequency scoring issues, such as adding "Restaurant" to every restaurant venue in our data, or aliasing "subway" and "sub way". I am open to other approaches to this.
if (place.addendum && place.addendum.override) { | ||
try { | ||
const overrideData = codec.decode(place.addendum.override); | ||
place = _.merge(place, overrideData); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This command is very powerful because it bypasses all the setter logic baked into pelias/model
which ensures validity and type correctness.
It would be fairly easy to do something silly like put a string where an array was expected for vice versa which would cause a runtime error and return a HTTP 500
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. I'm not sure how I could easily reuse the pelias-model logic it seems like it doesn't have a concept of loading/validating an object, so I don't know how I could use it at import or serving time to guarantee correctness.
Another approach would be to add a "displayName" property to pelias-model which seems almost more fragile since there would be so many places to decide if I should use name or displayName.
Do you have any better ideas on how this could be implemented?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orangejulius and I just discussed this in our call today.
It's a very powerful tool, so we're happy to merge this as an undocumented feature with the disclaimer that it can break things and errors made in the addendum will require a re-index to fix.
With that in mind I prefer the flexibility of addendum.override
than displayName
as it potentially allows prototyping other features in the future without changing code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me 👍
It's super useful when required but also a no-op when addendum.override
isn't specified.
…1494) * proof of concept for having addendums that override indexed fields * feat(override): allow addendum data to override imported place data * use _.has
This is part of a workaround for the issues we've seen in pelias with scoring on records that have multiple names. The issue is that elasticsearch calculates the term frequency over all the names, so a place like ["Sutter St Cafe", "The Sutter St Cafe"] is more likely to match "Sutter St" than a record with just the name ["Sutter St"]
This change allows us to create multiple records, one per name, but to specify in the addendum data what the canonical name (or any other data) is, so that we can generate lots of aliases for records and not worry too much about the term frequency scoring issues, such as adding "Restaurant" to every restaurant venue in our data, or aliasing "subway" and "sub way".
It's hard to say what the impact of this is - in our pelias set up we are generating aliases outside of pelias trunk, on commercial data. Without this change, creating aliases (like turning all "AA" tokens into "A A", generating compound/decompound mappings for words like coffeeshop<->coffee shop) we were seeing some regressions when only creating one record with multiple names, due to repeated tokens. With this change in approach, we are not seeing any regressions.
I would like to get this change into pelias master (rather than applying this hack at our application server level) so that I can continue performing side-by-side evaluations at the pelias api level on new indexes and accurately seeing what we are trying to show to users.
I am open to other approaches to this.