-
Notifications
You must be signed in to change notification settings - Fork 646
feat: add support for rw_uint256 data type #21984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Hi @chriswessels, thanks very much for the contribution
Could you add a few source/sink test cases in https://github.com/risingwavelabs/risingwave/tree/main/e2e_test/source_inline and https://github.com/risingwavelabs/risingwave/tree/main/e2e_test/sink? |
Thanks for the quick engagement here @lmatz! I've pushed additional testing of the type (and fixed the CAST_TABLE). Currently, the tests reveal an issue with Problem: The
8000
Edit: Previous suggested cause (memcomparable) has been ruled out. Edit: This has been resolved. |
UInt256 Implementation Update1. ORDER BY Correctness for Large UInt256 ValuesProblem: UInt256 values >= 2^255 were sorting incorrectly. The root cause was the Arrow conversion using Solution: Implemented a dual representation approach:
Why FixedSizeBinary?
This ensures correct byte-wise comparison for sorting while preserving compatibility with external systems. 2. Missing Arithmetic OperatorsAdded comprehensive arithmetic support between UInt256 and signed integers (int2, int4, int8):
Note: Division operators were already implemented in previous commits. 3. Sink Connector SupportFixed JSON encoder to properly handle Int256/UInt256 scalars by converting them to string representation, ensuring compatibility with external systems. Technical DetailsArrow Conversion Changes// Before (incorrect for large values)
fn uint256_to_arrow(&self, array: &UInt256Array) -> Result<ArrayRef, ArrayError> {
Ok(Arc::new(Decimal256Array::from(array)))
}
// After (correct unsigned ordering)
fn uint256_to_arrow(&self, array: &UInt256Array) -> Result<ArrayRef, ArrayError> {
// Convert to FixedSizeBinary(32) with big-endian bytes
let mut builder = FixedSizeBinaryBuilder::with_capacity(array.len(), 32);
for value in array.iter() {
match value {
Some(uint256) => builder.append_value(uint256.to_be_bytes())?,
None => builder.append_null(),
}
}
Ok(Arc::new(builder.finish()))
} Cast Table ImplicationsThe conservative cast table (requiring explicit casts from signed to unsigned) remains unchanged. This design choice prevents runtime failures by catching type mismatches at planning time. Testing
|
7381f31
to
4036ea1
Compare
5674337
to
8b9fc26
Compare
Implements 256-bit unsigned integer type using ethnum::u256 library. - Add UINT256 to protobuf data type enum - Implement ScalarImpl, ScalarRef, and ScalarRefImpl for UInt256 - Add UInt256Array implementation with arrow compatibility - Support serialization/deserialization in value encoding - Add hash key support for uint256 values - Include test utilities for generating random uint256 values - Extend type macros to support uint256 throughout the codebase This forms the foundation for uint256 support in RisingWave's type system.
- Map uint256 to Arrow Decimal256Array for data interchange - Add iceberg compatibility mapping for uint256 type - Support uint256 literals in frontend expression system - Extend data type utilities to handle uint256 - Add JSON conversion support for uint256 values - Update index selection rule to include uint256 type Enables efficient data exchange with external systems through Arrow format.
- BigQuery: map uint256 to BIGNUMERIC type - ClickHouse: use String representation for uint256 - Doris: encode as LARGEINT string representation - DynamoDB: serialize as string attribute - SQL Server: use VARCHAR(78) for uint256 values - StarRocks: encode as LARGEINT string - Add uint256 support in Avro, JSON, and Proto encoders - Include uint256 in Debezium JSON formatter - Extend MySQL and PostgreSQL parsers for uint256 compatibility Ensures uint256 values can be properly exported to various sink systems.
- Add binary arithmetic operators: +, -, *, /, % - Implement comparison operators: <, <=, >, >=, =, <> - Extend expression type macros to support uint256 - Add uint256 literal parsing in expression binder - Register uint256 in expression type system Enables arithmetic operations and comparisons on uint256 values in SQL queries.
- Implement casts from numeric types to uint256 - Add casts from uint256 to int256 and decimal - Support string to uint256 parsing - Update cast matrix with Implicit, Assign, and Explicit contexts - Handle overflow detection in numeric conversions Allows seamless type conversions between uint256 and other numeric types.
- Add sum aggregation for uint256 values - Implement min/max aggregates with proper state handling - Support statistical aggregates: avg, stddev_pop, stddev_samp, var_pop, var_samp - All statistical functions use float8 output type Enables aggregation operations on uint256 columns in streaming and batch queries.
- PostgreSQL: convert numeric types to uint256 with validation - JSON: parse uint256 from numeric and string representations - Add error handling for invalid uint256 conversions - Support list parsing for uint256 arrays Enables ingestion of uint256 values from PostgreSQL and JSON sources.
- Assign OID 1305 for rw_uint256 base type - Assign OID 1306 for _rw_uint256 array type - Register types in pg_type system catalog - Add proper I/O functions for PostgreSQL compatibility Enables PostgreSQL clients to properly recognize and handle uint256 types.
- Test DDL operations: CREATE TABLE, ALTER TABLE with uint256 - Verify DML operations: INSERT, UPDATE, DELETE - Test arithmetic operations and overflow behavior - Validate aggregate functions: sum, min, max, avg - Test type casting from numeric types and strings - Verify connector integration with PostgreSQL and JSON sources - Add boundary value tests for uint256 range Ensures uint256 type works correctly across all SQL operations.
- Document complete uint256 type implementation plan - Mark all three phases as complete with status indicators - List known limitations (hex literals, u512 accumulator, bitwise ops) - Include implementation details and file references - Specify PostgreSQL OID assignments (1305/1306) This specification serves as both design document and implementation guide for the uint256 feature in RisingWave.
- Add SQL usage examples for DDL, DML, and queries - Document connector integration patterns for PostgreSQL, JSON, and Avro - Provide sink configuration examples for major systems - Include common patterns like Wei to Ether conversion - Document limitations and performance considerations - Add best practices for handling overflow and external system integration The user guide provides practical examples for blockchain and DeFi use cases where uint256 is commonly needed.
Add uint256 test coverage matching existing int256 tests: - Batch type tests with arithmetic, comparisons, and aggregations - PostgreSQL catalog integration (OID 1305, casting rules) - CDC source tests for PostgreSQL NUMERIC to uint256 conversion - Sink tests for PostgreSQL and MongoDB - JSON serialization tests - Table-valued function tests - Edge case handling (overflow, underflow, NULL values)
Add test coverage for rw_int256 and rw_uint256 types in PostgreSQL sink integration test. Tests include: - Creating tables with numeric256 columns - Inserting various test values including edge cases - Validating data is correctly persisted to PostgreSQL NUMERIC(78,0) - Testing NULL handling for out-of-range uint256 values
Fix failing e2e tests by adding explicit type casts (::rw_uint256) to all integer literals in INSERT statements. The uint256 type requires explicit casting from integer literals in Assign context.
…mparable encoding The previous UInt256Ref::memcmp_serialize implementation used tuple serialization (hi, lo) which failed to preserve correct 256-bit ordering semantics. Large values requiring the high word were incorrectly sorted before smaller 64-bit values. Changes: - Replace tuple serialization with byte-by-byte big-endian serialization - Use serialize_u8() for each byte to ensure raw byte output without length tags - Update deserializer to read 32 bytes sequentially using u8::deserialize() - Add serde::ser::Serializer trait import for serialize_u8 method This ensures lexicographic ordering of serialized bytes matches numerical ordering of 256-bit unsigned integers, fixing ORDER BY queries for rw_uint256. Note: Tests still failing - may require further investigation into memcomparable encoding behavior or alternative serialization strategies.
…rators - Make cast table more conservative by requiring explicit casts for signed-to-unsigned conversions to prevent runtime failures - Add specialized division operators for uint256 with signed integers (uint256_div_int, int_div_uint256) - Fix UInt256 memcmp serialization to use proper big-endian byte ordering - Update tests to use numeric literals instead of string literals for consistency - Remove unimplemented cast tests and fix arithmetic expectations - Delete redundant basic uint256 test file
…c operators - Fix ORDER BY sorting issue for uint256 values >= 2^255 by changing Arrow conversion from signed Decimal256Array to FixedSizeBinary(32) with big-endian encoding. This ensures correct unsigned integer ordering. - Implement dual approach: use FixedSizeBinary internally for correct sorting while maintaining string representation for external systems (Iceberg, etc.) - Add missing arithmetic operators for uint256 with signed integers: - add(uint256, int*) and add(int*, uint256) - subtract(uint256, int*) and subtract(int*, uint256) - multiply(uint256, int*) and multiply(int*, uint256) - modulus(uint256, int*) and modulus(int*, uint256) - Fix JSON sink encoder to handle Int256/UInt256 scalars by converting to strings - Update tests to use explicit ORDER BY instead of rowsort for correct numeric ordering and add explicit casts for integer literals due to conservative cast table This ensures uint256 values sort correctly regardless of magnitude while maintaining compatibility with external systems.
This commit updates various catalog tests to account for the addition of int256/uint256 types and their array variants, and fixes a cast table alignment issue. Changes: 1. Updated pg_type tests to include _rw_int256 and _rw_uint256 array types - atlasgo.slt.part: Added 2 NULL and 2 'b' entries for array types - issue_10177.slt.part: Added array type entries with OIDs 1302 and 1306 - pg_type.slt.part: Added array type definitions 2. Updated pg_cast test to reflect new cast entries: - Added explicit casts from int2/int4/int8 to uint256 (entries 9, 18, 27) - Added bidirectional explicit casts between int256 and uint256 (entries 89, 92) - Removed implicit casts to uint256 from the second query (correctly explicit) 3. Fixed cast table alignment issue in cast.rs: - Removed trailing space from Int64 cast string to fix column alignment - This corrects the int8→varchar cast to be 'a' (assign) as intended The cast table fix ensures that int8→varchar remains an assignment cast, which is essential for common SQL patterns like inserting integers into text columns.
- Add UINT256_VALUE to allowed target types for numeric/decimal columns in PostgreSQL, MySQL, and SQL Server CDC validators - Enables mapping database numeric types to uint256 without validation errors - Prevents 'Incompatible data type' errors when creating CDC sources with uint256 columns
Add TryFrom<Decimal> implementations for Int256 and UInt256 types to support casting from numeric/decimal values. The implementation rounds decimals to integers and handles special cases like NaN/Infinity with appropriate errors. Also register the cast functions in the expression system to enable SQL casts.
Change test to use VARCHAR columns instead of NUMERIC(78,0) to store large 256-bit values. This avoids the rust_decimal deserialization limit of ~28 digits when reading from PostgreSQL.
- Add validation to allow int256/uint256 types to be written to PostgreSQL numeric columns - Implement ScalarAdapter conversion for UInt256 to PgNumeric - Support arrays of int256/uint256 to numeric[] conversion - Update type validation logic to handle the new type mappings
Use explicit string literal casting for large numeric values that exceed the parser's numeric literal range. This prevents overflow errors when inserting int256/uint256 values.
- Use sink_test database consistently throughout the test - Fix psql commands to match the pattern used in the rest of the file - Add proper output verification for the numeric256 sink test
The test now verifies the actual int256 and uint256 values are correctly written to PostgreSQL's numeric columns, not just the row descriptions. This ensures the full precision values are preserved through the sink.
8b9fc26
to
210d167
Compare
Hey @lmatz! I've added further test coverage and the full suite is now passing. I've updated the PR description with a fresh overview of the changes. |
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
Last updated: May 31st 2025
What's changed and what's your intention?
This pull request introduces a new
rw_uint256
data type to RisingWave, enabling the storage and manipulation of unsigned 256-bit integers. This enhancement is primarily aimed at supporting use cases involving very large non-negative numbers, such as those found in blockchain data (e.g., token amounts, transaction IDs), high-precision counters, and other scenarios where standard integer types are insufficient.Key Goals:
rw_uint256
.Detailed Changes and Considerations:
1. Core Type System & Representation:
DataType
Variant:DataType::UInt256
has been added to represent the unsigned 256-bit integer type.ScalarImpl::UInt256
andScalarRefImpl::UInt256
are introduced for handling individualrw_uint256
values.UInt256
struct is implemented, wrappingethnum::u256
(from theethnum
crate, already a dependency) to provide the core arithmetic and storage capabilities, similar to howInt256
utilizesethnum::i256
.UInt256Array
andUInt256ArrayBuilder
are now available for efficient, vectorized processing ofrw_uint256
data.data.proto
file has been updated to includeUINT256
in theDataType.TypeName
andArrayType
enums, ensuring compatibility for serialization and communication within the distributed system.rw_uint256
values are serialized as 32-byte little-endian byte arrays, aligning with the existingInt256
serialization for consistency.ORDER BY
clauses, join keys),rw_uint256
values are serialized using big-endian byte order to ensure correct lexicographical comparison.2. SQL Surface and Semantics:
rw_uint256
in DDL statements (e.g.,CREATE TABLE example (id INT, large_unsigned_val rw_uint256);
).rw_uint256
values are primarily input as string literals (e.g.,'123456789...'::rw_uint256
).'0x...'::rw_uint256
) are also supported for input, providing flexibility for blockchain-related data.+
,-
,*
,/
,%
) are implemented forrw_uint256
operands.[0, 2^256 - 1]
range will raise a "Numeric value out of range" error.x / 0
orx % 0
) will result in a "division by zero" error.=
,!=
,<
,<=
,>
,>=
) are implemented and behave as expected for unsigned integers.-
) is not directly applicable torw_uint256
as it's an unsigned type. AttemptingSELECT -some_uint256_column
will result in a parser error. Theneg(uint256)
function is provided, which returns 0 if the input is 0, and errors for any other positive input, reflecting the impossibility of representing a negative result.abs()
function: Returns the inputrw_uint256
value, as unsigned integers are always non-negative.rw_uint256
is used in operations with smaller integer types (SMALLINT
,INT
,BIGINT
), the result is promoted torw_uint256
. An error will occur if the smaller integer is negative.rw_uint256
andFLOAT8
will promote the result toFLOAT8
.rw_uint256
andDECIMAL
will promote the result toDECIMAL
.pg_catalog
) Updates:pg_type
: New entries have been added forrw_uint256
(OID 1305) and its array type_rw_uint256
(OID 1306). This includes metadata liketypinput
,typoutput
,typbasetype
,typlen
,typalign
, andtypcategory
.pg_cast
: The catalog has been updated to include definitions for all supported implicit (i
) and explicit (e
) casts involvingrw_uint256
and other compatible types (e.g.,int2
,int4
,int8
,numeric
,varchar
,float8
).3. Casting Behavior:
rw_uint256
:INT2
,INT4
,INT8
,SERIAL
: Supported (implicit and explicit). Raises an error if the source value is negative.VARCHAR
: Supported (explicit). Parses the string as a decimal or hexadecimal number. Errors on invalid format or if the value is outside the representable range ofrw_uint256
.DECIMAL
: Supported (explicit). The decimal value is rounded to the nearest integer. Errors if the value is negative or exceeds therw_uint256
range.INT256
: Explicit cast; errors if theINT256
value is negative.rw_uint256
:VARCHAR
: Converts therw_uint256
to its string representation.INT256
: Explicit cast. Errors if therw_uint256
value is too large to be represented as a signed 256-bit integer (i.e., if the most significant bit is set, which would indicate a negative number inInt256
).FLOAT8
: Converts to a double-precision floating-point number. Potential precision loss may occur for very largerw_uint256
values.DECIMAL
: Converts to a decimal number.INT2
,INT4
,INT8
: Explicit casts. Errors if therw_uint256
value is too large to fit within the target integer type's range.4. Aggregate Functions:
COUNT(rw_uint256)
: Standard count behavior.SUM(rw_uint256)
: Returnsrw_uint256
. Internally uses a wider accumulator type (currentlyu512
) to mitigate overflow during intermediate calculations. An error will be raised if the final sum exceedsUInt256::MAX
.AVG(rw_uint256)
: ReturnsFLOAT8
.MIN(rw_uint256)
,MAX(rw_uint256)
: Returnrw_uint256
.STDDEV_POP(rw_uint256)
,STDDEV_SAMP(rw_uint256)
,VAR_POP(rw_uint256)
,VAR_SAMP(rw_uint256)
: All returnFLOAT8
.5. Connector Interactions:
rw_uint256
is mapped to PostgreSQL'sNUMERIC
type.PostgresSink::type_derive()
andScalarAdapter
have been updated to handlerw_uint256
.DECIMAL
andNUMERIC
columns from these sources can now be correctly mapped torw_uint256
(parsed from string format).rw_uint256
values are represented as strings in JSON payloads (e.g.,"12345"
). This ensures compatibility with JSON parsers that might not handle numbers of this magnitude directly.rw_uint256
.rw_uint256
is represented asFixedSizeBinary(32)
with big-endian byte order. This allows for efficient storage and preserves correct sorting order.rw_uint256
columns are mapped toVARCHAR
(string) type. This ensures broader compatibility with tools and systems reading Parquet files, which may not have native support for 256-bit unsigned integers but can handle large numbers as strings.rw_uint256
in formats like Avro and Protobuf is generally not available.rw_uint256
values toVARCHAR
within theCREATE SINK
statement to ensure data integrity and avoid potential errors.6. Testing:
e2e_test/batch/basic/uint256.slt
ande2e_test/batch/types/uint256.slt.part
for comprehensive testing ofrw_uint256
functionalities.to_jsonb.slt.part
,pg_catalog.slt
,pg_type.slt
,pg_cast.slt
, and relevant connector tests (PostgreSQL, MySQL, SQL Server sources; PostgreSQL sink, Iceberg sink scenarios) to includerw_uint256
test cases.UInt256
type operations such as arithmetic, parsing, serialization, and boundary condition handling.7. Potential Failure Modes & Considerations:
rw_uint256
will result in errors.rw_uint256
value that exceeds the maximum value of a signed 256-bit integer torw_int256
will fail.INT2
,INT4
,INT8
) will fail if therw_uint256
value exceeds their respective maximums.rw_uint256
values to external systems. For sinks without direct support, explicit casting toVARCHAR
in the sink definition is the safest approach.This addition aims to provide a robust and flexible solution for handling large unsigned integers in RisingWave. We believe these changes will be valuable for users working with specific data domains.
Checklist
Documentation
Release note
✨ New Data Type:
rw_uint256
for Unsigned 256-bit IntegersRisingWave now supports the
rw_uint256
data type, designed for handling extremely large unsigned integer values, ranging from 0 up to 2^256 - 1. This is particularly beneficial for applications dealing with blockchain data, high-precision counters, or other scenarios requiring extended-range unsigned integers.Key Features and Behavior:
rw_uint256
.0x
) format.+
,-
,*
,/
,%
with overflow and division-by-zero checks.=
,!=
,<
,<=
,>
,>=
) work as expected.SUM
,AVG
(returnsFLOAT8
),MIN
,MAX
,COUNT
, and statistical aggregates likeSTDDEV_POP
are supported.SMALLINT
,INT
,BIGINT
),DECIMAL
, andVARCHAR
. Casting negative values will result in an error.VARCHAR
,FLOAT8
,DECIMAL
, and other integer types (with potential range errors).NUMERIC
type, facilitating data exchange via string representation.DECIMAL
andNUMERIC
columns can be read asrw_uint256
.rw_uint256
values are represented as strings.rw_uint256
columns will be mapped toVARCHAR
(string) type in the Parquet files. Internally, for Arrow processing,FixedSizeBinary(32)
is used.rw_uint256
may be limited. It's recommended to castrw_uint256
toVARCHAR
in yourCREATE SINK
statement for these connectors.Example Usage:
Use Cases:
NUMERIC
for whole numbers.Important Notes:
rw_uint256
strictly represents non-negative integers. Attempting to cast negative values torw_uint256
will result in an error.0
to2^256 - 1
range (e.g.,0::rw_uint256 - 1
) will cause an error.