RFC: a detailed design for create connection #35

tabVersion · 2023-01-14T18:04:49Z

preview: https://github.com/risingwavelabs/rfcs/blob/tab/create-connection/rfcs/0035-create-connection.md

fuyufjh · 2023-01-17T13:36:07Z

rfcs/0035-create-connection.md

+
+#### AWS Private Link
+
+An AWS Private Link establishes a private connection between a VPC and a service hosted on AWS. It is a secure and scalable way to connect to AWS services from within a VPC. The following fields are required to create an AWS Private Link connection:


IIUC, Private Link is orthogonal with the service to connect, and it only makes sense when being used with the definition of source/sink together, right?

PS. I guess you borrowed the idea from https://materialize.com/docs/sql/create-connection/

yes, private link provides a way to allow the service in our vpc to access resources in another vpc but it cannot be reversed.

fuyufjh · 2023-01-17T13:38:14Z

rfcs/0035-create-connection.md

+
+##### Internal table for AWS Private Link
+
+The `rw_aws_private_link` table stores the information of the AWS Private Link connection. The following fields are stored in the table:


Metadata is stored in etcd with protobuf encoding, rather than relational tables.

then implementing show connections should be enough

fuyufjh · 2023-01-17T13:47:44Z

rfcs/0035-create-connection.md

+
+-- create source with connection
+CREATE SOURCE {{ source_name }} ( {{ field }} = {{ value }}, ... )
+  FROM CONNECTION {{ connection_name }}


’With connection’ may be better than ‘from/to connection’ because:

It’s consistent between source & sink

To avoid the potential ambiguity of the word ‘from’

the syntax is the main part to be discussed. upvote for this proposal

fuyufjh · 2023-01-17T13:52:37Z

rfcs/0035-create-connection.md

+
+## Future possibilities
+
+* store some sensitive information, eg. passwords and SSL keys, in a encrypted way (use `CREATE SECRET`)


Question: In Flink if you connect to a catalog e.g. JDBC database, you also automatically get the schema of tables in that database. Will we support this? Does Materialize support this?

e.g.

CREATE CATALOG my_catalog WITH( 'type' = 'jdbc', 'default-database' = '...', 'username' = '...', 'password' = '...', 'base-url' = '...' ); USE CATALOG my_catalog; SHOW TABLES; -- Will print a list of tables -- OR: SHOW TABLES FROM my_catalog -- which does not require to `USE CATALOG` first. DESCRIBE TABLE foo; -- Will print the schema of that table.

I don't think it is related to connection, which specifies the parameters used when connecting to external systems.
The feature described above needs further investigation.

fuyufjh · 2023-01-17T13:55:14Z

rfcs/0035-create-connection.md

+
+## Motivation
+
+A connection is used to describe how to connect to an external system that users want to read data from. Once created, a connection is reusable across `CREATE SOURCE` and `CREATE SINK` statements. This RFC proposes a new `CREATE CONNECTION` statement to create a connection.


Any stronger motivation? I mean, like, beside the convenience it brings, is there anything that was impossible but become doable after it?

there is no proper abstraction other than connection to connect private link with the existing reader 🤣
@wyhyhyhyh has asked for this feature in the last Q

Emmm, they ("private link" and "connection") look independent with each other. "Connection" here provides not only the network access to VPC but also the very credentials it needs to connect to a specific data source/sink.

I mean, since you are proposing to allow users to write AWS VPC credentials along with the Kafka/Kinesis properties e.g.

CREATE CONNECTION demo_connection with ( connection_type = 'kinesis', aws.region='user_test_topic', endpoint='172.10.1.1:9090,172.10.1.2:9090', aws.credentials.role.arn='arn:aws-cn:iam::602389639824:role/demo_role', aws.credentials.role.external_id='demo_external_id', ... )

Then, why not just write AWS VPC credentials within CREATE SOURCE:

CREATE SOURCE demo_source with ( connector_type = 'kinesis', aws.region='user_test_topic', endpoint='172.10.1.1:9090,172.10.1.2:9090', aws.credentials.role.arn='arn:aws-cn:iam::602389639824:role/demo_role', aws.credentials.role.external_id='demo_external_id', ... )

Disclaimer: I am not arguing this syntax is better. I am saying the proposed solution seems not to be the solution to the problem of motivation.

fuyufjh · 2023-01-17T13:56:05Z

rfcs/0035-create-connection.md

+|`aws.credentials.role.arn`|Optional| The Amazon Resource Name (ARN) of the role to assume.|
+|`aws.credentials.role.external_id`|Optional|The [external id](https://aws.amazon.com/blogs/security/how-to-use-external-id-when-granting-access-to-your-aws-resources/) used to authorize access to third-party resources.|
+
+**Notice**: Risingwave will not check the connection to Kinesis is valid or not.


in create connection, we don't specify topic/stream. We may don't have access to list topics globally and there is no proper API for the validation.

adding a default topic to do the validation is acceptable but has concerns about misuse.

hzxa21

I am also a little bit skeptical about the benefits brought by this RFC. IMO, the benefits can be:

It provides a way to trigger the creation and approval process for AWS private link.
It makes it easier to reuse create mulitple sources/sinks on the same external system since user just needs to fill-in the cluster information once.

Several questions just come from the head of my mind when reading this RFC:

As mentioned in this RFC, the behaviors are different for different types of connection. For example, we do trigger the approval process and validate the connection for AWS Private Link while for other connections, we just store the information without validation. This might complicate things and makes me think this feature is more specific to AWS Private Link, not other systems.
If this feature is more specific to AWS Private Link, it looks like a cloud feature instead of a kernel/SQL feature. Have we considered triggering the creation and approval process in the cloud service/portal instead of via SQL? I am not sure whether this is doable but this can simplify the kernel implementation as well as the SQL syntax.
Will we still support the current CREATE SOURCE/SINK syntax, in which we provide the cluster information in the WITH clause? In other words, can user still create source/sink without create connection first? If yes, it may be a burden to implement a new source/sink. If not, user may get confused.

StrikeW · 2023-01-18T06:32:33Z

If this feature is more specific to AWS Private Link, it looks like a cloud feature instead of a kernel/SQL feature. Have we considered triggering the creation and approval process in the cloud service/portal instead of via SQL? I am not sure whether this is doable but this can simplify the kernel implementation as well as the SQL syntax.

Here is the background to implement private link support on the kernel side. https://www.notion.so/risingwave-labs/RFC-Using-AWS-PrivateLink-to-Connect-a-Kafka-Instance-7b33defa8af14caab4122fb6f06a5cb9
cc @mikechesterwang @Nebulazhang

tabVersion · 2023-01-18T14:03:18Z

As mentioned in this RFC, the behaviors are different for different types of connection. For example, we do trigger the approval process and validate the connection for AWS Private Link while for other connections, we just store the information without validation. This might complicate things and makes me think this feature is more specific to AWS Private Link, not other systems.

I want to validate connection in this step but we may not find an API for only validating brokers without a specific topic. We can require a topic here for validation.

Have we considered triggering the creation and approval process in the cloud service/portal instead of via SQL?

The user's kafka cluster is deployed in different vpc's and the client side must connect to the broker's leader node. If the client's request address is wrong, then the linked broker will return the correct broker's address within its vpc. This requires that source/sink can connect to all brokers and can rewrite the broker's address.
We would like to make this feature available to open-source users rather than directing them to connect via vpc peering

Will we still support the current CREATE SOURCE/SINK syntax, in which we provide the cluster information in the WITH clause?

Yes, users can still create source/sink without creating connection. If they provide both, options in with clause will be used.

If yes, it may be a burden to implement a new source/sink.

For kafka, yes, we may support private link for it. But for others, no, we only treat them as a hashmap and try to get lacking fields.

fuyufjh · 2023-01-26T06:30:13Z

If yes, it may be a burden to implement a new source/sink.

For kafka, yes, we may support private link for it. But for others, no, we only treat them as a hashmap and try to get lacking fields.

I guess Patrick's idea is to add these aws.... fields into other connectors' properties like Kinesis (example in #35 (comment)). In this way, they will also support AWS Private Link, right?

StrikeW · 2023-01-27T04:48:37Z

If yes, it may be a burden to implement a new source/sink.

For kafka, yes, we may support private link for it. But for others, no, we only treat them as a hashmap and try to get lacking fields.

I guess Patrick's idea is to add these aws.... fields into other connectors' properties like Kinesis (example in #35 (comment)). In this way, they will also support AWS Private Link, right?

IMO, the RFC introduces a new concept Connection to users. But the goals it wants to achieve (e.g. private link and reusable across CREATE SOURCE and CREATE SINK statements) can also be provided by the existing CREATE SOURCE/CREATE SINK statement (e.g. specify those aws.xx fields in the WITH clause). So I think it is important to clarify the motivation to introduce Connection. cc @neverchanje

tabVersion · 2023-01-30T08:04:11Z

Connection should be reusable, private link is only one of its functions and connection provides the basis for creating source/sink with a secret key later. Currently, the connection is not checked because we don't find a suitable api for it, we may fix it in future designs.
Also this level of abstraction is necessary, specifying the private link in the with clause is too complicated and needs to be matched with the kafka broker address one by one. But this can be left for later discussion.

The core of the current effort is to support private links, and in the short term I support providing this capability by

create source/sink {{ name }} ... with (
  connector = 'kafka',
  properties.bootstrap.server = 'ip1:port1,ip2:port2',
  private.links = '[{"service_name": "xxx", "availability zones: ["a", "b", "c"]", "port": 8080}, {...}, {...}]',
  ...
)

The problem is that CN does not have private link specifications, which requires an rpc back to meta.

hzxa21 · 2023-02-03T07:02:59Z

If yes, it may be a burden to implement a new source/sink.

For kafka, yes, we may support private link for it. But for others, no, we only treat them as a hashmap and try to get lacking fields.

I guess Patrick's idea is to add these aws.... fields into other connectors' properties like Kinesis (example in #35 (comment)). In this way, they will also support AWS Private Link, right?

Exactly. I am a littble bit conservative about introducing new SQL concept to kernel user since now we need to teach them what CONNECTION is, explain its relationship with SOURCE/SINK, and instruct them SOURCE/SINK can be created w/ or w/o CONNECTION. For cloud user, it may be fine because things can be hidden underneath.

If we do think introducing CONNECTION is necessary, I suggest we disallow CREATE SOURCE/SINK without CREATE CONNECTION to make the semantics clearer.

tabVersion · 2023-03-14T08:45:48Z

new connection syntax:

create connection connection_name with (
    type = 'private_link',
    provider = 'aws',
    service.name = 'xxx',
    availability.zones = '[{"az": ["a", "b", "c"], "port": 8080}, {...}, {...}]'
);

StrikeW · 2023-03-15T03:38:28Z

new connection syntax:

create connection connection_name with (
    type = 'private_link',
    provider = 'aws',
    service.name = 'xxx',
    availability.zones = '[{"az": ["a", "b", "c"], "port": 8080}, {...}, {...}]'
);

create source/sink {{ name }} ... with (
  connector = 'kafka',
  properties.bootstrap.server = 'ip1:port1,ip2:port2',
  private.links = 'connection_name',
  ...
);

we force users to create connection before creating a source using private link.

After retrospected the concepts of AWS privatelink, I think the above syntax can be improved.
The private link connection to be created is a vpc endpoint to access the endpoint service provided by the user. We only need a service name and AZs to create the endpoint, the ports are the listening ports of the endpoint service which can be changed. The vpc endpoint doesn't need to aware those ports, since we will use DNS name to access the endpoint service.

So I suggest putting the target ports to the CREATE SOURCE statement, and make it explicitly to show the mapping between the source broker address and the target AZ and port.

create connection connection_name with (
    type = 'private_link',
    provider = 'aws',
    service.name = 'xxx',
    availability.zones = '["az1", "az2", "az3"]'
);

create source/sink {{ name }} ... with (
  connector = 'kafka',
  properties.bootstrap.server = 'ip1:port1,ip2:port2',
  privatelink.name = 'connection_name',
  privatelink.targets = '[{"az": "az1", "port": 9001}, {"az": "az2", "port": 9002}]',
  ...
);

For example, the traffic from broker ip1:port1 will route to the listening port 9001 on endpoint service in AZ1.
cc @tabVersion @fuyufjh @mikechesterwang

tabVersion added 3 commits January 15, 2023 02:03

create connection

8d6a35d

add source/sink syntax

4335680

fix

861c765

fuyufjh reviewed Jan 17, 2023

View reviewed changes

hzxa21 reviewed Jan 18, 2023

View reviewed changes

modify syntax

1c46aec

tabVersion mentioned this pull request Feb 1, 2023

feature: support aws private link in kernel risingwavelabs/risingwave#7643

Closed

StrikeW mentioned this pull request Mar 1, 2023

feat(source): support private link for kafka connector risingwavelabs/risingwave#8247

Merged

6 tasks

StrikeW mentioned this pull request Mar 27, 2023

feat: CREATE CONNECTION SQL for Privatelink risingwavelabs/risingwave#8771

Closed

WillyKidd mentioned this pull request Mar 31, 2023

feat(source): introduce create/show connection statement risingwavelabs/risingwave#8907

Merged

7 tasks

tabVersion mentioned this pull request Aug 11, 2023

Feature request: Reuse connector WITH parameters risingwavelabs/risingwave#11545

Closed

fuyufjh approved these changes Aug 14, 2023

View reviewed changes

fuyufjh merged commit ae651f7 into main Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: a detailed design for create connection #35

RFC: a detailed design for create connection #35

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!


		#### AWS Private Link

		An AWS Private Link establishes a private connection between a VPC and a service hosted on AWS. It is a secure and scalable way to connect to AWS services from within a VPC. The following fields are required to create an AWS Private Link connection:


		##### Internal table for AWS Private Link

		The `rw_aws_private_link` table stores the information of the AWS Private Link connection. The following fields are stored in the table:


		## Future possibilities

		* store some sensitive information, eg. passwords and SSL keys, in a encrypted way (use `CREATE SECRET`)


		## Motivation

		A connection is used to describe how to connect to an external system that users want to read data from. Once created, a connection is reusable across `CREATE SOURCE` and `CREATE SINK` statements. This RFC proposes a new `CREATE CONNECTION` statement to create a connection.

RFC: a detailed design for create connection #35

RFC: a detailed design for create connection #35

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!