I want to add a metric type for `sum distinct` #4329

TuringLovesDeathMetal · 2023-01-30T16:31:09Z

Description

I want to be able to create a sum of values that are based on a distinct list of items.

For example, I have a table like this:

I want to create a sum of all item costs, based on the distinct item number. So, my sum_distinct for this table would be 15.

I would have to define the field I wanted to be distinct on for the metric, then define the dimension I'm summing. So the .yml could look something like this:

name: cost
  meta:
    metrics:
      sum_distinct_cost:
        type: sum_distinct
        distinct_keys: 
           - ${item_number}
           - ${item_name}

The SQL for this metric would look something like this (we'd probably need to use hashing functions):

SELECT 
SUM(DISTINCT(cost + HASH(CONCAT(${item_number), ${item_name})))) - SUM(DISTINCT(HASH(CONCAT(${item_number), ${item_name}))) AS sum_distinct_cost

FROM table

Acceptance criteria

I'm able to add sum distinct metrics in my .yml
I can define a sum distinct and specify which columns it needs to be distinct on
I can reference columns from joined tables in my distinct_key values
the distinct_key can take a list or a single value
docs have been updated to include the sum distinct metric: https://docs.lightdash.com/references/metrics#metric-types

The text was updated successfully, but these errors were encountered:

stale · 2023-04-01T12:31:08Z

Is this issue still relevant? There have been no updates for 60 days, please close the issue or keep the conversation going!

stale · 2023-06-19T08:36:44Z

Is this issue still relevant? There have been no updates for 60 days, please close the issue or keep the conversation going!

12ian34 · 2024-01-19T14:20:56Z

Just requested by a partner.

DougsHub · 2024-01-21T19:23:01Z

I'd love to request this feature as well!

fev4 · 2024-03-13T01:47:28Z

this would be such a great addition!

GaylorBrunner22 · 2024-05-06T07:35:28Z

Very relevant, I want it too !

yu-iskw · 2024-07-18T05:08:32Z

@TuringLovesDeathMetal told me a workaround to calculate something like sum_distinct using the symmetric aggregates approach. However, I found it difficult to precisely calculate it.

https://cloud.google.com/looker/docs/best-practices/understanding-symmetric-aggregates

TL;DR

I figured out that the symmetric aggregation on Lightdash with BigQuery is difficult, because overflow happens when summing the values.

Background

As far as I know, the FARM_FINGERPRINT function is the only built-in function to convert a value to a hash of number.

What I tried

I tried the symmetric aggregation with the orders table of the jaffle_shop dataset.
Here, I implement the metrics to virtually calculate the sum of the amount column on unique order_id values.
If we just sum the values of the amount column on unique order_id values, the value is 1672.0.

Step 1: Basic Implementation

I started with the following basic implementation.
However, I encountered the integer overflow error: Error in SUM aggregation: integer overflow on the Lightdash UI.

meta:
  metrics:
    sum_distinct_order_amount:
      type: number
      sql: |
        ${sum_distinct_order_amount_helper1} - ${sum_distinct_order_amount_helper2}
    sum_distinct_order_amount_helper1:
      type: sum
      hidden: true
      sql: |
        DISTINCT(${amount} + FARM_FINGERPRINT(${order_id}))
    sum_distinct_order_amount_helper2:
      type: sum
      hidden: true
      sql: |
        DISTINCT(FARM_FINGERPRINT(${order_id}))

Step 2: Casting to NUMERIC

So, I tried casting the values to NUMERIC to avoid the overflow issue.
However, I got a resulting negative value as -12288.

According to my research, the value of the amount column is relatively too small against the values of the FARM_FINGERPRINT function.
So, the calculation result can be incorrect.

meta:
  metrics:
    sum_distinct_order_amount:
      type: number
      sql: |
        ${sum_distinct_order_amount_helper1} - ${sum_distinct_order_amount_helper2}
    sum_distinct_order_amount_helper1:
      type: sum
      hidden: true
      sql: |
        DISTINCT(${amount} + SAFE_CAST(FARM_FINGERPRINT(${order_id}) AS NUMERIC))
    sum_distinct_order_amount_helper2:
      type: sum
      hidden: true
      sql: |
        DISTINCT(SAFE_CAST(FARM_FINGERPRINT(${order_id}) AS NUMERIC))

Step 3: Using smaller hash values

I assumed that the value of the FARM_FINGERPRINT function is too large to the values of the amount column.
So, I divided the resulting values to make the value smaller.
Finally, I got 1671.9999389648438 similar to the actual value of the summation.

However, if the results by dividing the values are not sure to unique order_id values.
If the uniqueness is not guaranteed, the results are not sure to be correct.
Meanwhile, the value to divide the resulting values is depends on the value of the amount column.
We probably need to find the appropriate value to divide the resulting values.

meta:
  metrics:
    sum_distinct_order_amount:
      type: number
      sql: |
        ${sum_distinct_order_amount_helper1} - ${sum_distinct_order_amount_helper2}
    sum_distinct_order_amount_helper1:
      type: sum
      hidden: true
      sql: |
        DISTINCT(${amount} + SAFE_CAST(FARM_FINGERPRINT(${order_id}) AS NUMERIC) / 123456789.0)
    sum_distinct_order_amount_helper2:
      type: sum
      hidden: true
      sql: |
        DISTINCT(SAFE_CAST(FARM_FINGERPRINT(${order_id}) AS NUMERIC) / 123456789.0)

lightdash-bot · 2024-09-06T07:54:11Z

This issue was mentioned by a user in slack: https://lightdash.slack.com/archives/C04KD95TF0U/p1725476387180909?thread_ts=1725476387.180909&cid=C04KD95TF0U

lightdash-bot · 2024-09-30T14:43:29Z

This issue was mentioned by a user in slack: https://lightdash-community.slack.com/archives/C03MG2VTR08/p1727705443187999?thread_ts=1727705443.187999&cid=C03MG2VTR08

lightdash-bot · 2024-10-04T12:47:20Z

This issue was mentioned by a user in slack: https://lightdash-community.slack.com/archives/C067S48BHRP/p1728016581620219?thread_ts=1728016581.620219&cid=C067S48BHRP

yu-iskw · 2024-11-12T01:25:29Z

A workaround to virtually calculate sum distinct count in Lightdash

I figured out a workaround to virtually implement sum distinct count in Lightdash with BigQuery. It isn't a perfect solution, as there might be some limitation. For instance, filtering by element(s) in the unnested field doesn't work.
We assume if we have the subsequent data with ARRAY type column.
We have to expand the ARRAY field into separate rows so that we can deal with each element as a row in Lightdash, because Lightdash doesn't natively support ARRAY type columns.
So, It causes the issue to calculate summation metrics on duplicated rows.

BigQuery offers ROW_NUMBER() window function, which can be used as a unique ID within each partition.
We can select only the first row of each expanded rows to avoid the issue.

We add a Lightdash dimension with ROW_NUMBER() window function to the ARRAY column.
We set a filter on the Lightdash UI to select only the first row of each expanded rows.

There are two approaches to add a dimension with ROW_NUMBER() window function.
First, we can implement a dimension with ROW_NUMBER() in the target dbt model.
Second, we can add a dimension with ROW_NUMBER() as an additional dimension in the Lightdash UI.

-- The original data doesn't have duplicated rows
WITH jobs AS (
  SELECT
    1 AS job_id,
    TIMESTAMP('2024-01-01T00:00:00Z') AS creation_time,
    ARRAY<STRING>['a', 'b'] AS tags,
  UNION ALL
  SELECT
    2 AS job_id,
    TIMESTAMP('2024-01-02T00:00:00Z') AS creation_time,
    ARRAY<STRING>['a', 'c'] AS tags,
  UNION ALL
  SELECT
    3 AS job_id,
    TIMESTAMP('2024-01-03T00:00:00Z') AS creation_time,
    ARRAY<STRING>[] AS tags,
)

-- We have to expand the ARRAY field into separate rows by the tag.
-- That means we have to create duplicate rows for the same job_id.
SELECT
  ROW_NUMBER() OVER (PARTITION BY jobs.job_id ORDER BY jobs.creation_time) AS job_id_row_number,
  jobs.job_id,
  jobs.creation_time,
  tag,
FROM jobs AS jobs
LEFT JOIN UNNEST(jobs.tags) AS tag

    columns:
      - name: job_id_row_number
        meta:
          dimension:
            type: number
            label: job_id_row_number

lightdash-bot · 2025-01-20T16:11:25Z

This issue was mentioned by a user in slack: https://lightdash-community.slack.com/archives/C067S48BHRP/p1737388379511399?thread_ts=1737388379.511399&cid=C067S48BHRP

lightdash-bot · 2025-05-14T17:45:19Z

This issue was mentioned by a user in slack: https://lightdash.slack.com/archives/C08RYCQJJCV/p1747079419585859?thread_ts=1747079419.585859&cid=C08RYCQJJCV

TuringLovesDeathMetal added the ✨ feature-request Request for a new feature or functionality label Jan 30, 2023

ZeRego added the ⚙️ backend This issue requires chang 8000 es to the backend label Jan 31, 2023

stale bot added the stale Issue hasn't been updated for 60 days label Apr 1, 2023

stale bot removed the stale Issue hasn't been updated for 60 days label Apr 20, 2023

stale bot added the stale Issue hasn't been updated for 60 days label Jun 19, 2023

stale bot removed the stale Issue hasn't been updated for 60 days label Jan 19, 2024

yu-iskw mentioned this issue Jul 17, 2024

Support ARRAY data type #10742

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

I want to add a metric type for `sum distinct` #4329

I want to add a metric type for `sum distinct` #4329

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

I want to add a metric type for sum distinct #4329

I want to add a metric type for sum distinct #4329

Comments

Uh oh!

Description

Acceptance criteria

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

TL;DR

Background

What I tried

Step 1: Basic Implementation

Step 2: Casting to NUMERIC

Step 3: Using smaller hash values

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

A workaround to virtually calculate sum distinct count in Lightdash

Uh oh!

Uh oh!

Uh oh!

I want to add a metric type for `sum distinct` #4329

I want to add a metric type for `sum distinct` #4329