feat: add sortable keys for record linkage #654
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The idea is to generate a list of sortable keys (buckets the fields hash into) so that we can find records which are similar. You can do a multi-compare against these and grab rows which are greater/less than the keys to shrink the amount of detailed similarity scoring calls to make.
You could then compute some traditional string distance metrics over these sortable keys to rank what's most similar. The keys move from general data to more specific.
With broad fields on the left this allows for prefix filtering in SQL. You could strip out Line1/Line2 data and filter down to a city level. Or find the rows nearby to an exact address by grabbing those greater and less than the target.