-
Notifications
You must be signed in to change notification settings - Fork 1
RFC: a detailed design for create connection #35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
#### AWS Private Link | ||
|
||
An AWS Private Link establishes a private connection between a VPC and a service hosted on AWS. It is a secure and scalable way to connect to AWS services from within a VPC. The following fields are required to create an AWS Private Link connection: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, Private Link is orthogonal with the service to connect, and it only makes sense when being used with the definition of source/sink together, right?
PS. I guess you borrowed the idea from https://materialize.com/docs/sql/create-connection/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, private link provides a way to allow the service in our vpc to access resources in another vpc but it cannot be reversed.
rfcs/0035-create-connection.md
Outdated
|
||
##### Internal table for AWS Private Link | ||
|
||
The `rw_aws_private_link` table stores the information of the AWS Private Link connection. The following fields are stored in the table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metadata is stored in etcd with protobuf encoding, rather than relational tables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then implementing show connections
should be enough
rfcs/0035-create-connection.md
Outdated
|
||
-- create source with connection | ||
CREATE SOURCE {{ source_name }} ( {{ field }} = {{ value }}, ... ) | ||
FROM CONNECTION {{ connection_name }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
’With connection’ may be better than ‘from/to connection’ because:
- It’s consistent between source & sink
- To avoid the potential ambiguity of the word ‘from’
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the syntax is the main part to be discussed. upvote for this proposal
|
||
## Future possibilities | ||
|
||
* store some sensitive information, eg. passwords and SSL keys, in a encrypted way (use `CREATE SECRET`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: In Flink if you connect to a catalog e.g. JDBC database, you also automatically get the schema of tables in that database. Will we support this? Does Materialize support this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g.
CREATE CATALOG my_catalog WITH(
'type' = 'jdbc',
'default-database' = '...',
'username' = '...',
'password' = '...',
'base-url' = '...'
);
USE CATALOG my_catalog;
SHOW TABLES;
-- Will print a list of tables
-- OR: SHOW TABLES FROM my_catalog
-- which does not require to `USE CATALOG` first.
DESCRIBE TABLE foo;
-- Will print the schema of that table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is related to connection
, which specifies the parameters used when connecting to external systems.
The feature described above needs further investigation.
|
||
## Motivation | ||
|
||
A connection is used to describe how to connect to an external system that users want to read data from. Once created, a connection is reusable across `CREATE SOURCE` and `CREATE SINK` statements. This RFC proposes a new `CREATE CONNECTION` statement to create a connection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any stronger motivation? I mean, like, beside the convenience it brings, is there anything that was impossible but become doable after it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is no proper abstraction other than connection
to connect private link with the existing reader 🤣
@wyhyhyhyh has asked for this feature in the last Q
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emmm, they ("private link" and "connection") look independent with each other. "Connection" here provides not only the network access to VPC but also the very credentials it needs to connect to a specific data source/sink.
I mean, since you are proposing to allow users to write AWS VPC credentials along with the Kafka/Kinesis properties e.g.
CREATE CONNECTION demo_connection
with (
connection_type = 'kinesis',
aws.region='user_test_topic',
endpoint='172.10.1.1:9090,172.10.1.2:9090',
aws.credentials.role.arn='arn:aws-cn:iam::602389639824:role/demo_role',
aws.credentials.role.external_id='demo_external_id',
...
)
Then, why not just write AWS VPC credentials within CREATE SOURCE
:
CREATE SOURCE demo_source
with (
connector_type = 'kinesis',
aws.region='user_test_topic',
endpoint='172.10.1.1:9090,172.10.1.2:9090',
aws.credentials.role.arn='arn:aws-cn:iam::602389639824:role/demo_role',
aws.credentials.role.external_id='demo_external_id',
...
)
Disclaimer: I am not arguing this syntax is better. I am saying the proposed solution seems not to be the solution to the problem of motivation.
|`aws.credentials.role.arn`|Optional| The Amazon Resource Name (ARN) of the role to assume.| | ||
|`aws.credentials.role.external_id`|Optional|The [external id](https://aws.amazon.com/blogs/security/how-to-use-external-id-when-granting-access-to-your-aws-resources/) used to authorize access to third-party resources.| | ||
|
||
**Notice**: Risingwave will not check the connection to Kinesis is valid or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in create connection
, we don't specify topic/stream. We may don't have access to list topics globally and there is no proper API for the validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adding a default topic to do the validation is acceptable but has concerns about misuse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also a little bit skeptical about the benefits brought by this RFC. IMO, the benefits can be:
- It provides a way to trigger the creation and approval process for AWS private link.
- It makes it easier to reuse create mulitple sources/sinks on the same external system since user just needs to fill-in the cluster information once.
Several questions just come from the head of my mind when reading this RFC:
- As mentioned in this RFC, the behaviors are different for different types of connection. For example, we do trigger the approval process and validate the connection for AWS Private Link while for other connections, we just store the information without validation. This might complicate things and makes me think this feature is more specific to AWS Private Link, not other systems.
- If this feature is more specific to AWS Private Link, it looks like a cloud feature instead of a kernel/SQL feature. Have we considered triggering the creation and approval process in the cloud service/portal instead of via SQL? I am not sure whether this is doable but this can simplify the kernel implementation as well as the SQL syntax.
- Will we still support the current CREATE SOURCE/SINK syntax, in which we provide the cluster information in the WITH clause? In other words, can user still create source/sink without create connection first? If yes, it may be a burden to implement a new source/sink. If not, user may get confused.
Here is the background to implement private link support on the kernel side. https://www.notion.so/risingwave-labs/RFC-Using-AWS-PrivateLink-to-Connect-a-Kafka-Instance-7b33defa8af14caab4122fb6f06a5cb9 |
I want to validate connection in this step but we may not find an API for only validating brokers without a specific topic. We can require a topic here for validation.
The user's kafka cluster is deployed in different vpc's and the client side must connect to the broker's leader node. If the client's request address is wrong, then the linked broker will return the correct broker's address within its vpc. This requires that source/sink can connect to all brokers and can rewrite the broker's address.
Yes, users can still create source/sink without creating connection. If they provide both, options in with clause will be used.
For kafka, yes, we may support private link for it. But for others, no, we only treat them as a hashmap and try to get lacking fields. |
I guess Patrick's idea is to add these |
IMO, the RFC introduces a new concept |
Connection should be reusable, private link is only one of its functions and connection provides the basis for creating source/sink with a secret key later. Currently, the connection is not checked because we don't find a suitable api for it, we may fix it in future designs. The core of the current effort is to support private links, and in the short term I support providing this capability by create source/sink {{ name }} ... with (
connector = 'kafka',
properties.bootstrap.server = 'ip1:port1,ip2:port2',
private.links = '[{"service_name": "xxx", "availability zones: ["a", "b", "c"]", "port": 8080}, {...}, {...}]',
...
) The problem is that CN does not have private link specifications, which requires an rpc back to meta. |
Exactly. I am a littble bit conservative about introducing new SQL concept to kernel user since now we need to teach them what If we do think introducing |
new connection syntax: create connection connection_name with (
type = 'private_link',
provider = 'aws',
service.name = 'xxx',
availability.zones = '[{"az": ["a", "b", "c"], "port": 8080}, {...}, {...}]'
); |
After retrospected the concepts of AWS privatelink, I think the above syntax can be improved. So I suggest putting the target ports to the CREATE SOURCE statement, and make it explicitly to show the mapping between the source broker address and the target AZ and port. create connection connection_name with (
type = 'private_link',
provider = 'aws',
service.name = 'xxx',
availability.zones = '["az1", "az2", "az3"]'
);
create source/sink {{ name }} ... with (
connector = 'kafka',
properties.bootstrap.server = 'ip1:port1,ip2:port2',
privatelink.name = 'connection_name',
privatelink.targets = '[{"az": "az1", "port": 9001}, {"az": "az2", "port": 9002}]',
...
); For example, the traffic from broker |
preview: https://github.com/risingwavelabs/rfcs/blob/tab/create-connection/rfcs/0035-create-connection.md