8000 I want to add a metric type for `sum distinct` · Issue #4329 · lightdash/lightdash · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

I want to add a metric type for sum distinct #4329

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
5 tasks
TuringLovesDeathMetal opened this issue Jan 30, 2023 · 13 comments
Open
5 tasks

I want to add a metric type for sum distinct #4329

TuringLovesDeathMetal opened this issue Jan 30, 2023 · 13 comments
Labels
⚙️ backend This issue requires changes to the backend ✨ feature-request Request for a new feature or functionality

Comments

@TuringLovesDeathMetal
Copy link
Contributor
TuringLovesDeathMetal commented Jan 30, 2023

Description

I want to be able to create a sum of values that are based on a distinct list of items.

For example, I have a table like this:
image

I want to create a sum of all item costs, based on the distinct item number. So, my sum_distinct for this table would be 15.

I would have to define the field I wanted to be distinct on for the metric, then define the dimension I'm summing. So the .yml could look something like this:

name: cost
  meta:
    metrics:
      sum_distinct_cost:
        type: sum_distinct
        distinct_keys: 
           - ${item_number}
           - ${item_name}

The SQL for this metric would look something like this (we'd probably need to use hashing functions):

SELECT 
SUM(DISTINCT(cost + HASH(CONCAT(${item_number), ${item_name})))) - SUM(DISTINCT(HASH(CONCAT(${item_number), ${item_name}))) AS sum_distinct_cost

FROM table

Acceptance criteria

  • I'm able to add sum distinct metrics in my .yml
  • I can define a sum distinct and specify which columns it needs to be distinct on
  • I can reference columns from joined tables in my distinct_key values
  • the distinct_key can take a list or a single value
  • docs have been updated to include the sum distinct metric: https://docs.lightdash.com/references/metrics#metric-types
@TuringLovesDeathMetal TuringLovesDeathMetal added the ✨ feature-request Request for a new feature or functionality label Jan 30, 2023
@ZeRego ZeRego added the ⚙️ backend This issue requires chang 8000 es to the backend label Jan 31, 2023
@stale
Copy link
stale bot commented Apr 1, 2023

Is this issue still relevant? There have been no updates for 60 days, please close the issue or keep the conversation going!

@stale stale bot added the stale Issue hasn't been updated for 60 days label Apr 1, 2023
@stale stale bot removed the stale Issue hasn't been updated for 60 days label Apr 20, 2023
@stale
Copy link
stale bot commented Jun 19, 2023

Is this issue still relevant? There have been no updates for 60 days, please close the issue or keep the conversation going!

@stale stale bot added the stale Issue hasn't been updated for 60 days label Jun 19, 2023
@12ian34
Copy link
Contributor
12ian34 commented Jan 19, 2024

Just requested by a partner.

@stale stale bot removed the stale Issue hasn't been updated for 60 days label Jan 19, 2024
@DougsHub
Copy link

I'd love to request this feature as well!

@fev4
Copy link
fev4 commented Mar 13, 2024

this would be such a great addition!

@GaylorBrunner22
Copy link

Very relevant, I want it too !

@yu-iskw
Copy link
Contributor
yu-iskw commented Jul 18, 2024

@TuringLovesDeathMetal told me a workaround to calculate something like sum_distinct using the symmetric aggregates approach. However, I found it difficult to precisely calculate it.

https://cloud.google.com/looker/docs/best-practices/understanding-symmetric-aggregates

TL;DR

I figured out that the symmetric aggregation on Lightdash with BigQuery is difficult, because overflow happens when summing the values.

Background

  • As far as I know, the FARM_FINGERPRINT function is the only built-in function to convert a value to a hash of number.

What I tried

I tried the symmetric aggregation with the orders table of the jaffle_shop dataset.
Here, I implement the metrics to virtually calculate the sum of the amount column on unique order_id values.
If we just sum the values of the amount column on unique order_id values, the value is 1672.0.

Step 1: Basic Implementation

I started with the following basic implementation.
However, I encountered the integer overflow error: Error in SUM aggregation: integer overflow on the Lightdash UI.

meta:
  metrics:
    sum_distinct_order_amount:
      type: number
      sql: |
        ${sum_distinct_order_amount_helper1} - ${sum_distinct_order_amount_helper2}
    sum_distinct_order_amount_helper1:
      type: sum
      hidden: true
      sql: |
        DISTINCT(${amount} + FARM_FINGERPRINT(${order_id}))
    sum_distinct_order_amount_helper2:
      type: sum
      hidden: true
      sql: |
        DISTINCT(FARM_FINGERPRINT(${order_id}))

Step 2: Casting to NUMERIC

So, I tried casting the values to NUMERIC to avoid the overflow issue.
However, I got a resulting negative value as -12288.

According to my research, the value of the amount column is relatively too small against the values of the FARM_FINGERPRINT function.
So, the calculation result can be incorrect.

meta:
  metrics:
    sum_distinct_order_amount:
      type: number
      sql: |
        ${sum_distinct_order_amount_helper1} - ${sum_distinct_order_amount_helper2}
    sum_distinct_order_amount_helper1:
      type: sum
      hidden: true
      sql: |
        DISTINCT(${amount} + SAFE_CAST(FARM_FINGERPRINT(${order_id}) AS NUMERIC))
    sum_distinct_order_amount_helper2:
      type: sum
      hidden: true
      sql: |
        DISTINCT(SAFE_CAST(FARM_FINGERPRINT(${order_id}) AS NUMERIC))

Step 3: Using smaller hash values

I assumed that the value of the FARM_FINGERPRINT function is too large to the values of the amount column.
So, I divided the resulting values to make the value smaller.
Finally, I got 1671.9999389648438 similar to the actual value of the summation.

However, if the results by dividing the values are not sure to unique order_id values.
If the uniqueness is not guaranteed, the results are not sure to be correct.
Meanwhile, the value to divide the resulting values is depends on the value of the amount column.
We probably need to find the appropriate value to divide the resulting values.

meta:
  metrics:
    sum_distinct_order_amount:
      type: number
      sql: |
        ${sum_distinct_order_amount_helper1} - ${sum_distinct_order_amount_helper2}
    sum_distinct_order_amount_helper1:
      type: sum
      hidden: true
      sql: |
        DISTINCT(${amount} + SAFE_CAST(FARM_FINGERPRINT(${order_id}) AS NUMERIC) / 123456789.0)
    sum_distinct_order_amount_helper2:
      type: sum
      hidden: true
      sql: |
        DISTINCT(SAFE_CAST(FARM_FINGERPRINT(${order_id}) AS NUMERIC) / 123456789.0)

@lightdash-bot
Copy link
Collaborator

@lightdash-bot
Copy link
Collaborator

@lightdash-bot
Copy link
Collaborator

@yu-iskw
Copy link
Contributor
yu-iskw commented Nov 12, 2024

A workaround to virtually calculate sum distinct count in Lightdash

I figured out a workaround to virtually implement sum distinct count in Lightdash with BigQuery. It isn't a perfect solution, as there might be some limitation. For instance, filtering by element(s) in the unnested field doesn't work.
We assume if we have the subsequent data with ARRAY type column.
We have to expand the ARRAY field into separate rows so that we can deal with each element as a row in Lightdash, because Lightdash doesn't natively support ARRAY type columns.
So, It causes the issue to calculate summation metrics on duplicated rows.

BigQuery offers ROW_NUMBER() window function, which can be used as a unique ID within each partition.
We can select only the first row of each expanded rows to avoid the issue.

  1. We add a Lightdash dimension with ROW_NUMBER() window function to the ARRAY column.
  2. We set a filter on the Lightdash UI to select only the first row of each expanded rows.

There are two approaches to add a dimension with ROW_NUMBER() window function.
First, we can implement a dimension with ROW_NUMBER() in the target dbt model.
Second, we can add a dimension with ROW_NUMBER() as an additional dimension in the Lightdash UI.

image

-- The original data doesn't have duplicated rows
WITH jobs AS (
  SELECT
    1 AS job_id,
    TIMESTAMP('2024-01-01T00:00:00Z') AS creation_time,
    ARRAY<STRING>['a', 'b'] AS tags,
  UNION ALL
  SELECT
    2 AS job_id,
    TIMESTAMP('2024-01-02T00:00:00Z') AS creation_time,
    ARRAY<STRING>['a', 'c'] AS tags,
  UNION ALL
  SELECT
    3 AS job_id,
    TIMESTAMP('2024-01-03T00:00:00Z') AS creation_time,
    ARRAY<STRING>[] AS tags,
)

-- We have to expand the ARRAY field into separate rows by the tag.
-- That means we have to create duplicate rows for the same job_id.
SELECT
  ROW_NUMBER() OVER (PARTITION BY jobs.job_id ORDER BY jobs.creation_time) AS job_id_row_number,
  jobs.job_id,
  jobs.creation_time,
  tag,
FROM jobs AS jobs
LEFT JOIN UNNEST(jobs.tags) AS tag
    columns:
      - name: job_id_row_number
        meta:
          dimension:
            type: number
            label: job_id_row_number

@lightdash-bot
Copy link
Collaborator

@lightdash-bot
Copy link
C 6E19 ollaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
⚙️ backend This issue requires changes to the backend ✨ feature-request Request for a new feature or functionality
Projects
None yet
Development

No branches or pull requests

8 participants
0