8000 RFC: a detailed design for create connection by tabVersion · Pull Request #35 · risingwavelabs/rfcs · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

RFC: a detailed design for create connection #35

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 14, 2023
Merged

Conversation

tabVersion
Copy link
Contributor


#### AWS Private Link

An AWS Private Link establishes a private connection between a VPC and a service hosted on AWS. It is a secure and scalable way to connect to AWS services from within a VPC. The following fields are required to create an AWS Private Link connection:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, Private Link is orthogonal with the service to connect, and it only makes sense when being used with the definition of source/sink together, right?

PS. I guess you borrowed the idea from https://materialize.com/docs/sql/create-connection/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, private link provides a way to allow the service in our vpc to access resources in another vpc but it cannot be reversed.


##### Internal table for AWS Private Link

The `rw_aws_private_link` table stores the information of the AWS Private Link connection. The following fields are stored in the table:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metadata is stored in etcd with protobuf encoding, rather than relational tables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then implementing show connections should be enough


-- create source with connection
CREATE SOURCE {{ source_name }} ( {{ field }} = {{ value }}, ... )
FROM CONNECTION {{ connection_name }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

’With connection’ may be better than ‘from/to connection’ because:

  1. It’s consistent between source & sink
  2. To avoid the potential ambiguity of the word ‘from’

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the syntax is the main part to be discussed. upvote for this proposal


## Future possibilities

* store some sensitive information, eg. passwords and SSL keys, in a encrypted way (use `CREATE SECRET`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: In Flink if you connect to a catalog e.g. JDBC database, you also automatically get the schema of tables in that database. Will we support this? Does Materialize support this?

Copy link
Contributor
@fuyufjh fuyufjh Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g.

CREATE CATALOG my_catalog WITH(
    'type' = 'jdbc',
    'default-database' = '...',
    'username' = '...',
    'password' = '...',
    'base-url' = '...'
);

USE CATALOG my_catalog;

SHOW TABLES;
-- Will print a list of tables

-- OR: SHOW TABLES FROM my_catalog
-- which does not require to `USE CATALOG` first.

DESCRIBE TABLE foo;
-- Will print the schema of that table.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is related to connection, which specifies the parameters used when connecting to external systems.
The feature described above needs further investigation.


## Motivation

A connection is used to describe how to connect to an external system that users want to read data from. Once created, a connection is reusable across `CREATE SOURCE` and `CREATE SINK` statements. This RFC proposes a new `CREATE CONNECTION` statement to create a connection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any stronger motivation? I mean, like, beside the convenience it brings, is there anything that was impossible but become doable after it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no proper abstraction other than connection to connect private link with the existing reader 🤣
@wyhyhyhyh has asked for this feature in the last Q

Copy link
Contributor
@fuyufjh fuyufjh Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emmm, they ("private link" and "connection") look independent with each other. "Connection" here provides not only the network access to VPC but also the very credentials it needs to connect to a specific data source/sink.

I mean, since you are proposing to allow users to write AWS VPC credentials along with the Kafka/Kinesis properties e.g.

CREATE CONNECTION demo_connection
  with (
    connection_type = 'kinesis',
    aws.region='user_test_topic',
    endpoint='172.10.1.1:9090,172.10.1.2:9090',
    aws.credentials.role.arn='arn:aws-cn:iam::602389639824:role/demo_role',
    aws.credentials.role.external_id='demo_external_id',
   ...
  )

Then, why not just write AWS VPC credentials within CREATE SOURCE:

CREATE SOURCE demo_source
  with (
    connector_type = 'kinesis',
    aws.region='user_test_topic',
    endpoint='172.10.1.1:9090,172.10.1.2:9090',
    aws.credentials.role.arn='arn:aws-cn:iam::602389639824:role/demo_role',
    aws.credentials.role.external_id='demo_external_id',
   ...
  )

Disclaimer: I am not arguing this syntax is better. I am saying the proposed solution seems not to be the solution to the problem of motivation.

|`aws.credentials.role.arn`|Optional| The Amazon Resource Name (ARN) of the role to assume.|
|`aws.credentials.role.external_id`|Optional|The [external id](https://aws.amazon.com/blogs/security/how-to-use-external-id-when-granting-access-to-your-aws-resources/) used to authorize access to third-party resources.|

**Notice**: Risingwave will not check the connection to Kinesis is valid or not.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in create connection, we don't specify topic/stream. We may don't have access to list topics globally and there is no proper API for the validation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding a default topic to do the validation is acceptable but has concerns about misuse.

Copy link
@hzxa21 hzxa21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also a little bit skeptical about the benefits brought by this RFC. IMO, the benefits can be:

  1. It provides a way to trigger the creation and approval process for AWS private link.
  2. It makes it easier to reuse create mulitple sources/sinks on the same external system since user just needs to fill-in the cluster information once.

Several questions just come from the head of my mind when reading this RFC:

  • As mentioned in this RFC, the behaviors are different for different types of connection. For example, we do trigger the approval process and validate the connection for AWS Private Link while for other connections, we just store the information without validation. This might complicate things and makes me think this feature is more specific to AWS Private Link, not other systems.
  • If this feature is more specific to AWS Private Link, it looks like a cloud feature instead of a kernel/SQL feature. Have we considered triggering the creation and approval process in the cloud service/portal instead of via SQL? I am not sure whether this is doable but this can simplify the kernel implementation as well as the SQL syntax.
  • Will we still support the current CREATE SOURCE/SINK syntax, in which we provide the cluster information in the WITH clause? In other words, can user still create source/sink without create connection first? If yes, it may be a burden to implement a new source/sink. If not, user may get confused.

@StrikeW
Copy link
Contributor
StrikeW commented Jan 18, 2023
  • If this feature is more specific to AWS Private Link, it looks like a cloud feature instead of a kernel/SQL feature. Have we considered triggering the creation and approval process in the cloud service/portal instead of via SQL? I am not sure whether this is doable but this can simplify the kernel implementation as well as the SQL syntax.

Here is the background to implement private link support on the kernel side. https://www.notion.so/risingwave-labs/RFC-Using-AWS-PrivateLink-to-Connect-a-Kafka-Instance-7b33defa8af14caab4122fb6f06a5cb9
cc @mikechesterwang @Nebulazhang

@tabVersion
Copy link
Contributor Author
tabVersion commented Jan 18, 2023

As mentioned in this RFC, the behaviors are different for different types of connection. For example, we do trigger the approval process and validate the connection for AWS Private Link while for other connections, we just store the information without validation. This might complicate things and makes me think this feature is more specific to AWS Private Link, not other systems.

I want to validate connection in this step but we may not find an API for only validating brokers without a specific topic. We can require a topic here for validation.

Have we considered triggering the creation and approval process in the cloud service/portal instead of via SQL?

The user's kafka cluster is deployed in different vpc's and the client side must connect to the broker's leader node. If the client's request address is wrong, then the linked broker will return the correct broker's address within its vpc. This requires that source/sink can connect to all brokers and can rewrite the broker's address.
We would like to make this feature available to open-source users rather than directing them to connect via vpc peering

Will we still support the current CREATE SOURCE/SINK syntax, in which we provide the cluster information in the WITH clause?

Yes, users can still create source/sink without creating connection. If they provide both, options in with clause will be used.

If yes, it may be a burden to implement a new source/sink.

For kafka, yes, we may support private link for it. But for others, no, we only treat them as a hashmap and try to get lacking fields.

@fuyufjh
Copy link
Contributor
fuyufjh commented Jan 26, 2023

If yes, it may be a burden to implement a new source/sink.

For kafka, yes, we may support private link for it. But for others, no, we only treat them as a hashmap and try to get lacking fields.

I guess Patrick's idea is to add these aws.... fields into other connectors' properties like Kinesis (example in #35 (comment)). In this way, they will also support AWS Private Link, right?

@StrikeW
Copy link
Contributor
StrikeW commented Jan 27, 2023

If yes, it may be a burden to implement a new source/sink.

For kafka, yes, we may support private link for it. But for others, no, we only treat them as a hashmap and try to get lacking fields.

I guess Patrick's idea is to add these aws.... fields into other connectors' properties like Kinesis (example in #35 (comment)). In this way, they will also support AWS Private Link, right?

IMO, the RFC introduces a new concept Connection to users. But the goals it wants to achieve (e.g. private link and reusable across CREATE SOURCE and CREATE SINK statements) can also be provided by the existing CREATE SOURCE/CREATE SINK statement (e.g. specify those aws.xx fields in the WITH clause). So I think it is important to clarify the motivation to introduce Connection. cc @neverchanje

@tabVersion
Copy link
Contributor Author
tabVersion commented Jan 30, 2023

Connection should be reusable, private link is only one of its functions and connection provides the basis for creating source/sink with a secret key later. Currently, the connection is not checked because we don't find a suitable api for it, we may fix it in future designs.
Also this level of abstraction is necessary, specifying the private link in the with clause is too complicated and needs to be matched with the kafka broker address one by one. But this can be left for later discussion.

The core of the current effort is to support private links, and in the short term I support providing this capability by

create source/sink {{ name }} ... with (
  connector = 'kafka',
  properties.bootstrap.server = 'ip1:port1,ip2:port2',
  private.links = '[{"service_name": "xxx", "availability zones: ["a", "b", "c"]", "port": 8080}, {...}, {...}]',
  ...
)

The problem is that CN does not have private link specifications, which requires an rpc back to meta.

@hzxa21
Copy link
hzxa21 commented Feb 3, 2023

If yes, it may be a burden to implement a new source/sink.

For kafka, yes, we may support private link for it. But for others, no, we only treat them as a hashmap and try to get lacking fields.

I guess Patrick's idea is to add these aws.... fields into other connectors' properties like Kinesis (example in #35 (comment)). In this way, they will also support AWS Private Link, right?

Exactly. I am a littble bit conservative about introducing new SQL concept to kernel user since now we need to teach them what CONNECTION is, explain its relationship with SOURCE/SINK, and instruct them SOURCE/SINK can be created w/ or w/o CONNECTION. For cloud user, it may be fine because things can be hidden underneath.

If we do think introducing CONNECTION is necessary, I suggest we disallow CREATE SOURCE/SINK without CREATE CONNECTION to make the semantics clearer.

@tabVersion
Copy link
Contributor Author
tabVersion commented Mar 14, 2023

new connection syntax:

create connection connection_name with (
    type = 'private_link',
    provider = 'aws',
    service.name = 'xxx',
    availability.zones = '[{"az": ["a", "b", "c"], "port": 8080}, {...}, {...}]'
);

@StrikeW
Copy link
Contributor
StrikeW commented Mar 15, 2023

new connection syntax:

create connection connection_name with (
    type = 'private_link',
    provider = 'aws',
    service.name = 'xxx',
    availability.zones = '[{"az": ["a", "b", "c"], "port": 8080}, {...}, {...}]'
);
create source/sink {{ name }} ... with (
  connector = 'kafka',
  properties.bootstrap.server = 'ip1:port1,ip2:port2',
  private.links = 'connection_name',
  ...
);

we force users to create connection before creating a source using private link.

After retrospected the concepts of AWS privatelink, I think the above syntax can be improved.
The private link connection to be created is a vpc endpoint to access the endpoint service provided by the user. We only need a service name and AZs to create the endpoint, the ports are the listening ports of the endpoint service which can be changed. The vpc endpoint doesn't need to aware those ports, since we will use DNS name to access the endpoint service.

So I suggest putting the target ports to the CREATE SOURCE statement, and make it explicitly to show the mapping between the source broker address and the target AZ and port.

create connection connection_name with (
    type = 'private_link',
    provider = 'aws',
    service.name = 'xxx',
    availability.zones = '["az1", "az2", "az3"]'
);

create source/sink {{ name }} ... with (
  connector = 'kafka',
  properties.bootstrap.server = 'ip1:port1,ip2:port2',
  privatelink.name = 'connection_name',
  privatelink.targets = '[{"az": "az1", "port": 9001}, {"az": "az2", "port": 9002}]',
  ...
);

For example, the traffic from broker ip1:port1 will route to the listening port 9001 on endpoint service in AZ1.
cc @tabVersion @fuyufjh @mikechesterwang

@fuyufjh fuyufjh merged commit ae651f7 into main Aug 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0