8000 Proposal: DelegatedIndex by danielcrenna · Pull Request #26 · bleroy/lunr-core · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Proposal: DelegatedIndex #26

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 50 commits into from
Jan 1, 2021
Merged

Proposal: DelegatedIndex #26

merged 50 commits into from
Jan 1, 2021

Conversation

danielcrenna
Copy link
Contributor

Finally, here is a starter proof of concept for a delegated index.

This is in support of #13 and #19. So that indexes can be saved to disk, added to incrementally, et. al.

The reference implementation shows an LMDB-backed (memory mapped files on disk) DelegatedIndex that defers the five core functions of the in-memory index, so that any read-only index implementation could be plugged in.

Then, the mutable indexing features are added via regular methods.

The performance test indicates the delegated index variant has negligible performance impacts (for some reason it's faster when I run it locally, but it's within the tolerance for the original tests).

It's done as additive-only to not touch any of the original code.

Before I go too far, I wanted to get your feedback on this approach.

@bleroy
Copy link
Owner
bleroy commented Dec 29, 2020

Oooh, cool, thanks for the contrib. Will review ASAP.

@danielcrenna
Copy link
Contributor Author

Oooh, cool, thanks for the contrib. Will review ASAP.

There's some hiccup in the tests I'll investigate if this is deemed a viable pathway -- looks like a linux LMDB issue.

@danielcrenna
Copy link
Contributor Author

It looks like they only provide a runtime for linux-arm, and that they want you to install lmdb for your flavor of Linux via the OS package manager: CoreyKaylor/Lightning.NET#71

Copy link
Owner
@bleroy bleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pretty cool, thanks for the awesome contribution. My biggest feedback is on the delegate vs. interface thing.

@@ -0,0 +1,6 @@
<wpf:ResourceDictionary xml:space="preserve" xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml" xmlns:s="clr-namespace:System;assembly=mscorlib" xmlns:ss="urn:shemas-jetbrains-com:settings-storage-xaml" xmlns:wpf="http://schemas.microsoft.com/winfx/2006/xaml/presentation">
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also exclude the Resharper dotsettings files.

foreach (string expandedTerm in expandedTerms)
{
// For each term get the posting and termIndex, this is required for building the query vector.
InvertedIndexEntry posting = GetInvertedIndexEntryByKey(expandedTerm);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've got a bunch of possible null value warnings here. I know there are some in the original code as well, that I'll take care of, but it would be great to fix those as well before I merge

{
unsafe
{
fixed(byte* buf = &buffer.GetPinnableReference())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious how much better this is than using Span<T>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to answer this via: 4482cfd

Note that the benchmark implementation is kind of wonky because I couldn't find an API that allowed me to construct a Stream from a Span<byte>, so I had to call the primitive methods directly.

It does look like Span is the clear winner, but I'm not confident in my implementation since it requires manually slicing the buffer.

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.1256 (1909/November2018Update/19H2)
Intel Core i9-9820X CPU 3.30GHz, 1 CPU, 20 logical and 10 physical cores
.NET Core SDK=5.0.101
  [Host]     : .NET Core 3.1.10 (CoreCLR 4.700.20.51601, CoreFX 4.700.20.51901), X64 RyuJIT
  DefaultJob : .NET Core 3.1.10 (CoreCLR 4.700.20.51601, CoreFX 4.700.20.51901), X64 RyuJIT

Method Mean Error StdDev
GetPinnableReference 583.48 ns 2.811 ns 2.629 ns
Span 81.90 ns 0.356 ns 0.297 ns

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty spectacular. Your implementation looks fine to me, but piece of mind should just be a matter of writing enough tests ;)

@danielcrenna
Copy link
Contributor Author

@bleroy I believe everything is in place and ready for your review.

  • Added custom serialization support for primitives, string, and Lunr.Slice.
  • Added an LmdbBuilder that mimics the original Builder
  • Rewrote deserialization to use Span<T>
  • Added test coverage and fixed lots of failures

Here are the current search performance comparisons between Index and LmdbIndex/DelegatedIndex

Index

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.1256 (1909/November2018Update/19H2)
Intel Core i9-9820X CPU 3.30GHz, 1 CPU, 20 logical and 10 physical cores
.NET Core SDK=5.0.101
  [Host]     : .NET Core 3.1.10 (CoreCLR 4.700.20.51601, CoreFX 4.700.20.51901), X64 RyuJIT
  DefaultJob : .NET Core 3.1.10 (CoreCLR 4.700.20.51601, CoreFX 4.700.20.51901), X64 RyuJIT

Method Mean Error StdDev
SearchSingleTerm 23.89 μs 0.051 μs 0.048 μs
SearchMultipleTerms 41.98 μs 0.121 μs 0.108 μs
SearchTrailingWildcard 13.27 μs 0.051 μs 0.045 μs
SearchLeadingWildcard 53.93 μs 0.081 μs 0.072 μs
SearchContainedWildcard 16.32 μs 0.027 μs 0.024 μs
SearchWithField 21.30 μs 0.053 μs 0.049 μs
SearchWithEditDistance 105.73 μs 0.227 μs 0.190 μs
SearchTypeAhead 30.34 μs 0.085 μs 0.080 μs
SearchNegatedQuery 22.65 μs 0.059 μs 0.049 μs
SearchRequiredTerm 44.11 μs 0.103 μs 0.096 μs

LmdbIndex

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.1256 (1909/November2018Update/19H2)
Intel Core i9-9820X CPU 3.30GHz, 1 CPU, 20 logical and 10 physical cores
.NET Core SDK=5.0.101
  [Host]     : .NET Core 3.1.10 (CoreCLR 4.700.20.51601, CoreFX 4.700.20.51901), X64 RyuJIT
  DefaultJob : .NET Core 3.1.10 (CoreCLR 4.700.20.51601, CoreFX 4.700.20.51901), X64 RyuJIT

Method Mean Error StdDev
SearchSingleTerm 81.22 μs 0.370 μs 0.346 μs
SearchMultipleTerms 148.10 μs 0.465 μs 0.388 μs
SearchTrailingWildcard 69.95 μs 0.545 μs 0.483 μs
SearchLeadingWildcard 107.46 μs 1.168 μs 1.092 μs
SearchContainedWildcard 70.52 μs 0.288 μs 0.269 μs
SearchWithField 74.15 μs 0.834 μs 0.780 μs
SearchWithEditDistance 158.38 μs 0.573 μs 0.536 μs
SearchTypeAhead 172.28 μs 0.728 μs 0.646 μs
SearchNegatedQuery 80.54 μs 0.303 μs 0.284 μs
SearchRequiredTerm 148.06 μs 1.081 μs 1.011 μs

@danielcrenna danielcrenna requested a review from bleroy December 31, 2020 01:23
@danielcrenna
Copy link
Contributor Author

@bleroy review fixes are in.

You can assign any issues that come up for LMDB to me.

I will also issue a follow-up PR with markdown docs for LMDB usage once merged, and once I use it in anger for my use case.

I'm still thinking through how to unlock the feature of adding new documents to the index after an index has been built, without having the user need to do it themselves via AddField, AddFieldVector, et. al., since that will require more knowledge than anyone would want to have.

Copy link
Owner
@bleroy bleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing we need to decide: packaging. I think we should put LMDB support in a separate package that has a dependency on the core package, if only because its other dependencies are a little more complex and I don't want to impose additional constraints to use in-memory features. It's your call whether you want to publish the new package yourself on NuGet or if you want me to continue managing the publication of both. I'm fine with either.

@bleroy bleroy merged commit 0c79006 into bleroy:main Jan 1, 2021
@bleroy
Copy link
Owner
bleroy commented Jan 1, 2021

Thanks for the hard work and great features you put into this PR!

@danielcrenna
Copy link
Contributor Author

Thanks for the hard work and great features you put into this PR!

Thanks for the existence of this project and for being responsive over a holiday! I'm looking forward to using it to provide full text search for reference data without the need to manage an ES cluster, will come in handy in all kinds of scenarios.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0