-
Notifications
You must be signed in to change notification settings - Fork 646
RFC: approx_count_distinct
state table encoding
#3414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
An alternative design is to store all counts of a bucket in one KV-entry, the size would be 92 bytes and seems okay to me. |
I am also thinking about whether my design is good, because the memory usage it quite high (~1.5MB). This would be a problem for both 2-phase agg and HashAgg. Perhaps we should only provides an append-only |
Hmm, indeed. One way is to use the The partial agg can compute partial results of the harmonic mean. As for hash agg, we may not need to support it. Or, we could throw a runtime error if the number of group by keys grows too large? |
That's a very interesting insight! But it seems hard to be used along with other agg functions. |
Yes, this is a feasible solution. I am just not sure whether this is what our users want. Anyway, the |
We may hold this RFC for a while, I think |
Yes, indeed. It is not lightweight when the distinct elements are small in number and group keys are large in number. I think we may need the sparse-dense transition to handle this. |
I have added back this issue to the project board, to remind that we have to need to make a decision on whether we decide to support or disable |
The difficulty is: the state, as described in the RFC, is very special compared with these aggregate functions that we have supported (e.g. min/max or sum/count) |
Note that one has to persist a different state per vnode to enable consistent scaling. |
What's the status of this RFC? |
Current status of
As said in previous discussion, we may consider not to allow this function in hash agg, because the state can explode ridiculously. When there's no many rows within each group, with a correctly implemented non-append-only state, there're fixed 65536 rows for each group, and the value column is of type If we do decide to only support this agg for simple agg, we can fearlessly have a state table storing each bucket as one row, then we can have a correct non-append-only implementation. |
Uh oh!
There was an error while loading. Please reload this page.
RFC Template
Background
We need to consider
approx_count_distinct
state table encoding to facilitate:Design
Encoding of
approx_count_distinct
key:
table_id | bucket_id (16 bits) | count_index (8 bits)
(count_index ranges from 0-48 and represents trailing zeroes)value:
count (u64)
bucket_id (16 bits) | count_index (8 bits)
forms the PK for the item.Seems like vnode does not factor into the encoding since we are assuming that
approx_count_distinct
is a simple agg singleton operator.Future Optimizations
We could consider
approx_count_distinct
withgroup_by
. Because of the large default in-memory state size ofapprox_count_distinct
, we may only be able to support a small number ofgroup_by
keys.Discussions
Q&A
The text was updated successfully, but these errors were encountered: