Terraform/Ansible demonstration of multi-tenant Lustre.
This repo creates and configures networks and nodes to represent the use of Lustre in a multi-tenant configuration:
network 1 network 2 network 3
(tcp1) (tcp2) (tcp3)
| | |
+-[lustre-storage] | |
+-[lustre-admin] | |
+-[lustre-client1] +-[lustre-client2] +-[lustre-client3]
| | +-[lustre-comp3-0]
| | |
+---[lustre-lnet2]---+---[lustre-lnet3]---+
All networks are virtual but are intended to represent:
- tcp1: A low-latency storage network - nodes on this are considered "trusted" in some sense
- tcp2: Ethernet
- tcp3: A project-specific software-defined network - nodes on this are untrusted
As well as being a virtual network, each network above is also a Lustre network (lnet).
The nodes on the networks are then:
lustre-storage
: The Lustre server, acting as MGS, MDT and OST. This exports a single fileystemtest_fs1
. It is given a public IP and serves as a proxy for ssh access to nodes.lustre-admin
: A Lustre client used to admininster the fileystem - it has a privileged view of the real owners/permissions.lustre-client[1-3]
: Lustre clients with different access levels to the filesystem (discussed below)lustre-comp3-0
: A lustre client which can be configured as slurm compute node (discussed below)lustre-lnet[2-3]
: Lnet routers to provide connectivity between clients and server across the different networks.
Various Lustre features are used to control how clients can access the filesystem:
- lnet routes: These allow lustre traffic to cross network types, but also define define and hence control connectivity between clients and server.
- filesets: These restrict which subdirectories of the lustre filesystem clients can mount.
- nodemaps: These can be used to alter user's effective permissions, such as squashing root users to non-privileged users.
- shared keys: These can be used to prevent mounting of the filesystem unless client and server have appropriate keys, and/or to encrypt data in transit. Note this feature is not functional at present - see Known Issues.
In addition this repo provides two extra tools to help prevent misconfiguration:
lnet-test.yml
runslnet ping
to check connectivity is present between clients and server, and is not present between clients in different lnets.verify.yml
exports lustre configuration to files on the control host, then automatically diffs them against a previous known-good configuation (if available).
For demonstration purposes, the test_fs
lustre fileystem contains two root directories, /csd3
and /srcp
and two "project directories":
/csd3/proj12
mounted byclient1
andclient2
/srcp/proj3
mounted only byclient3
Project directories are owned by a "project owner" user and group of the same name as the project, and given permissionsug=rwx,+t
.
The Lustre configurations applied to the clients model different access scenarios:
client1
models a client on the CSD3 low-latency network which shares LDAP with the server and has full access to the filesystem (i.e. as controlled by normal Linux permissions), except that the client's root user is not privileged.client2
models a client with access to the same project, but which does not share LDAP and has restricted access with the client's root user acting as the project owner.client3
models a client in a isolated SRDP project, which does not share LDAP and has restricted access with a specificdatamanager
user acting as the project owner.
No LDAP service is actually provided here and all users are defined/configured by Ansible. The following demonstration users are set up to permit testing the above access control scenarios:
client1
:andy
andalex
client2
:becky
andben
client3
:catrin
andcharlie
For details of how these aspects are configured, see Configuration below.
The below assumes deployment on vss
from ilab-gate
.
NB There are some rough edges to this, see Known Issues if problems are encountered.
Download the 0.12.x release, unzip and install it:
export terraform_version="0.12.20"
wget https://releases.hashicorp.com/terraform/${terraform_version}/terraform_${terraform_version}_linux_amd64.zip
unzip terraform_${terraform_version}_linux_amd64.zip
From Horizon download a clouds.yaml file and put it in:
terraform/examples/slurm/clouds.yaml
Now use Terraform to create the infrastructure:
cd terraform/examples/slurm
terraform init
terraform plan
terraform apply
cd ansible
virtualenv .venv
. .venv/bin/activate
pip install -U pip
pip install -U -r requirements.txt
ansible-galaxy install -r requirements.yml
In the ansible/
directory With the virtualenv activated as above, run:
ansible-playbook -i inventory main.yml
Note that the inventory file is a symlink to the output of terraform.
Once this has completed, there will be Lustre configuration in ansible/lustre-configs-live/
. To provide protection against misconfiguration, review these
for correctness and then copy (and potentially commit them) to ansible/lustre-configs-good/
. Ansible will then compare live config against this each time it is run
and warn if there are differerences.
Optionally you may also set up the following:
Run:
ansible-playbook -i inventory monitoring.yml -e "grafana_password=<PASSWORD>"
where <PASSWORD>
should be replaced with a password of your choice.
This installs Prometheus (to collect monitoring data), Graphana (to query and display it) and exporters for both general node statistics (memory, cpu etc) as well as HP's Lustre exporter.
The lustre-storage
node (see ssh_proxy
in inventory
for IP) then hosts Prometheus at port 9090 and Graphana (username="admin", password as chosen) at port 3000.
Run:
ansible-playbook -i inventory slurm.yml
This will:
- Create an OpenHPC slurm cluster in net3, using
lustre-client3
as the combined control/login node andlustre-comp3-0
as a single compute node. - Configure Lustre to track slurm jobs.
- Run a demo slurm job which creates a 10GB file on the lustre filesystem and then copies it.
This demo job should show up in the "jobstats" section of the Lustre Graphana dashboard. Note that jobstats are disabled on non-slurm clients, but the jobstats from slurm clients also appear to show processes which do not have a slurm job id.
You can re-run the demo job, skipping the other steps for speed, using:
ansible-playbook -i inventory slurm.yml --tags demo
To ssh into nodes use:
ssh <ansible_ssh_common_args> centos@<private_ip>
where both <ansible_ssh_common_args>
and the relevant <private_ip>
are defined in ansible/inventory
.
For routers, note the relevant IP is the one for the lower-numbered network it is connected to.
This section explains how the Lustre configuration described above is defined here.
This is defined by the groups in the inventory template at terraform/modules/cluster/inventory.tpl
.
These are defined by ansible/group_vars
:
lnet_tcpX
: These define the first interface on each node, and hence the lnets which exist. So all nodes (including routers), need to be added to the appropriate one of these groups.lnet_router_tcpX_to_tcpY
: These define the 2nd interface for routers and also set the routing enabled flag. So only routers need to be added to these groups. Note that the convention here is thateth0
goes on the lower-numbered network, and that this is the side ansible uses to configure router nodes.lnet_tcpX_from_tcpY
: These define routes, so any nodes (clients, storage or routers) which need to access nodes on other networks need to be in one or more of these groups. In the routesdict
in these groups there should be one entry, with the key defining the "end" network (matching the "X" in the filename) and a value defining the gateway. Note a dict is used withhash_behaviour = merge
set inansible/ansible.cfg
so that nodes can be put in more than one routing group, and will end up with multiple entries in theirroutes
var. In the example here this is needed for the storage server, which requires routes to bothtcp2
andtcp3
.
These groups are then used to generate a configuration file for each node using the ansible/lnet.conf.j2
template as part of the appropriate server/router/client role, and then imported to lustre.
Note that ansible will enforce that ONLY the automatically-defined routes are present.
Additional general information about how lnet routes work is provided under Lustre networks below.
As client's can only be in one nodemap, a nodemap is generated for each client group (e.g. client_net1
etc.). Nodemap parameters are set using the lustre
mapping, with default values (which match Lustre's own defaults) given in ansible/group_vars/all
overriden as required for specific client groups (e.g. in ansible/group_vars/client_net1.yml
).
The key/value pairs in the lustre
mapping function essentially as described in the Lustre nodemap documentation to provide maximum flexibility. In brief:
trusted
determines whether client users can see the filesystem's canonical identifiers. Note these identifies are uid/gid - what user/group names these resolve (if at all) to depends on the users/groups present on the server.admin
controls whether root is squashed. The user/group it is squashed to is defined by thesquash_uid
andsquash_gid
parameters.squash_uid
andsquash_gid
define which user/group unmapped client users/groups are squashed to on the server. Note that although the lustre documentation states squashing is disabled by default, in fact (under 2.12 and 2.14 at least) the squashed uid and gid default to 99 (thenobody
user). Therefore if squashing is not required thetrusted
property must be set.deny_unknown
if set, prevents access to all users not defined in the nodemap.fileset
if set, restricts the client to mounting only this subdirectory of the Lustre filesystem1.idmaps
define specific users/groups to map, contains a list where each item is a 3-list of:- mapping type 'uid' or 'gid'
- client uid/gid to map to ...
- ... uid/gid on server
This configuration follows Lustre's concepts/terminology very closely, although the use of Ansible makes it somewhat more user-friendly as for example uids can be looked up from usernames.
Note that ansible will enforce that no nodemaps exist other than the ones it defines and the default
nodemap, and that all parameters on those nodemaps match the ansible configuration, including client ranges and id maps.
The demo users andy
etc are defined for each client individually in group_vars/client_net*.yml:users
. These users are created on the appropriate clients by users.yml
which also creates the client1 users on the server to fake shared LDAP. While the lustre documentation specifically states that uid and gids are required to be the same "on all clients" this is not necessarily the case when clients are mounting isolated directories as here.
Note that group_vars/all.yml
also defines root
and nobody
users - these are default OS users and are defined here purely to allow them to be referred to in the client nodemap setup. The combination of user
mappings from the all
and client group_vars files requires having hash_behaviour = merge
in ansible's configuration (as does the lnet configuration described above).
The project owner and project member user/groups are defined by group_vars/all.yml:projects
and are also created by users.yml
. For simplicity, these are the same on all clients and the server although strictly client2 does not need the proj12
user/group, etc.
Project directories are defined by group_vars/all.yml:projects
. The root
key is prepended to the project name to give the project's path in the lustre filesystem.
This section provides extended context and discussion of configuration and behaviour.
There are essentially 3 aspects to be configured:
- All nodes must have an interface (e.g. eth0) on an Lnet (e.g. tcp1): Note that an Lnet itself is only actually defined by the set of nodes with an interface on it - there is no "stand-alone" definition of an LNET.
- Routers also need an interface onto a 2nd Lnet, and a routing enabled flag set on.
- Nodes which need to be able to reach nodes on other networks need routes to be defined. Note that this includes any routers which need to route messages to networks they are not directly connected to.
A few aspects of routes may not be are not obvious:
- Routes need to be set up bi-directionally, and asymmetric routes are an advanced feature not recommended for normal use by the documentation.
- Routes are defined for a specific node, but to a whole network. This means that you can enable e.g. a client in net3 to reach storage in net1, without the reverse route enabling a client in net1 to access the client in net3 (because the reverse route is only defined for storage1).
- Routes are defined in terms of the "end" network and the gateway to access to get there. The gateway is the router which provides the "closest" hop towards the end network.
Multi-hop paths require routes to be defined along the way: e.g. if node "A" in network 1 needs to go through networks 2 and 3 to reach node "B" in network 4 then:
- node "A" needs a route to 4 to be defined using the gateway router from 1-2.
- The router forming the 1-2 gateway needs a route to 4 to be defined using a gateway from 2-3.
- The router forming the 2-3 gateway needs a route to 4 to be defined using a gateway from 3-4.
It is not necessarily obvious how to configure the nodemap functionality, project directory permissions and users/groups to give the desired access control. This section therefore provides narrative explanation of how the example configuration here actually works to provide the outcomes defined in Projects and Users. If experimenting with configuration note that:
-
While the manual says nodemap changes propagate in ~10 seconds, it was found necessary to unmount and remount the filesystem to get changes to apply, although this was nearly instantaneous and proved robust.
-
Reducing the caching of user/group upcalls from the default 20 minutes to 1 second is recommended using:
[centos@lustre-storage ~]$ sudo lctl set_param mdt.*.identity_expire=1
-
Whether modifying configuration using ansible or lustre commands, running the
verify.yml
playbook and reviewingansible/lustre-configs-live/lustre-storage-nodenet.conf
is a convenient way to check the actual lustre configuration.
Firstly, note that the actual lustre fileystem configuration (defined in group_vars/all/yml:projects
) is as follows:
/csd3/proj12
: owner=proj12
group=proj12
mode=drwxrwx--T
/srcp/proj3
: owner=proj3
group=proj3
mode=drwxrwx--T
This can be seen from the admin
client which has both the trusted
and admin
properties set so its users (including root
) can see the real filesystem ids.
Secondly, note that users.yml
ensures the "project owner" users/groups (e.g. proj12
) and "project member" user/groups (e.g. proj12-member
) are present on both the server and clients 1-3:
- For the server and client1, this models LDAP as mentioned above.
- For the other clients these users/groups would have to be configured in some other way on both the client and the server, but uid/gids could differ between client and server.
- Users/groups do not necessarily need to be present on clients which do not mount the associated project directory (e.g. client3 does not need
proj12
andproj12-member
) - this is done here purely to simplify the logic and configuration. admin.yml
also ensures these users/groups are present on theadmin
client; this is not a lustre requirement but is done here because ansible uses user/group names rather than uid/gids when creating the project directories.
The client configurations are then as follows:
- As
fileset=/csd3
the client's/mnt/lustre
provides access to any project directories within/csd3
, e.g./mnt/lustre/proj12
->/csd/proj12
, but prevents access to projects in/srcp
.
Considering access to /csd3/proj12/
:
- Because
trusted=true
all client users see the true uid/gids in the filesystem hence permissions generally function as if it were a local directory given users/groups are present on both server and client. - Client users
alex
andandy
have a secondary group (on both server an client) ofproj12
hence get group permissions in the directory. - As
admin=false
theroot
user is squashed to the defaultsquash_uid
andsquash_gid
of 99, i.e. usernobody
and therefore has no permissions in the directory. - The client user
centos
(not defined byusers.yml
but present on both client and server as a default OS user) does not have the correct secondary group and hence cannot access the directory.
- As
fileset=/csd3/proj12
the client's/mnt/lustre
only provides access to this directory. - Because
trusted=false
ALL users must be either defined in theidmap
or will be subject to user/group squashing. - The client's root user is mapped to
proj12
which gives it owner permissions in the project directory. It also means the project directory's owner to appears asroot
(rather than the realproj12
) to all client users. Note the root group is not mapped as we want the project directory's group to appear asproj12
. - Users are squashed to
proj12-member
(i.e. a non-owning user) and groups toproj12
(i.e. the directory's real group). Users therefore do not own the project directory but do match the directory's group. - However to actually get the group permissions, client users (e.g.
becky
andben
) must also be members of theproj12
group (on the client). It is not clear why this is necessary, given the group squashing. It is not necessary for the server user they are squashed to (proj12-member
) to be a member of the appropriate group (proj12
) on the server. - The default OS user
centos
is not a member ofproj12
and hence has no access to the directory.
- As
fileset=/srcp/proj3
the client's/mnt/lustre
only provides access to this directory. - The nodemap and user configuration is exactly comparable to that for client 2, except that the client user
datamanager
(instead ofroot
) is mapped to the project ownerproj3
. Note this user only exists on the client. - Behaviour for demo users
catrin
andcharlie
and the default OS usercentos
is exacly analogous to client 2. - As
root
is not idmapped it is squashed to userproj3-member
and groupproj3
as for all other users. However, unlike normal users it has group access without needing to haveproj3
as a secondary group..
NB: For mounting the filesystme to succeed the mounting user must be able to access the mounted directory, so any squashing/mapping must be appropriate AND the appropriate users are present on client and server. E.f if mounting as root
but this is mapped to another user, the users must be set up first. This is why mounting is done separately in mount.yml
after users.yml
, rather than as part of the clients.yml
playbook.
When run, the Ansible will enforce that:
In any of these cases it may be necessary to remove configuration manually using lustre commands. Note that the verify.yml
playbook can identify these issues if a known-good configuration is defined.
-
If ssh keys change you will need to manually confirm connection
-
If you see any of the below errors from Ansible just rerun the Ansible command:
- Authenticity of host cannot be established (may require accepting fingerprint)
- Timeout waiting for priviledge escalation prompt
- Failures during installation of Lustre client kmods (possibly this is hitting repo rate limiting?)
- Failures of
lnet-test.yml
(possibly server not ready?) - obviously repeated failures are bad - When running
monitoring.ym
: "failure 1 during daemon-reload: Failed to execute operation: Interactive authentication required."
-
Shared-key security (ssk) does not currently work due to:
- A bug in how Lustre handles
sudo
for ssk. - Reverse DNS lookups (required for ssk) not working in the VSS environment as configured here - fixing this is tricky due to the (OS) network setup.
Therefore at presentgroup_vars/all.yml:ssk_flavor
should be set to'null'
to disable this.
- A bug in how Lustre handles
-
When running projects.yml, nodemap configuration parameters will always show as changed, even if they actually aren't. This is due to lustre CLI limitations.
Suggested routes for development are:
- Extend
tools/lustre-tools/lnet.py
to provide animport
function similar totools/lustre-tools/nodemap.py
: currently lnet config is set by exporting the live config, deleting all of it (which raises some errors which we ignore) then importing the desired config. This means ansible a) reports errors and b) always shows config as changed. - In an environment where the reverse DNS lookup works correctly (i.e.
nslookup <ip_addr>
returns a name), work around the ssk sudo bug (e.g. by logging in as root when mounting) and test ssk functionality/performance. (Note currentauthorized_keys
entries forroot
prevent running commands.) - Add/use an
eth0_address
variable for hosts in addition toansible_host
to protect against unusual cases of the latter. - Use ganesha running on the tenant router to re-export the lustre filesystem over NFS for the tenant's clients. This would remove the need for the clients to be running lustre.
- Update the lustre Prometheus exporter to use
https://github.com/HewlettPackard/lustre_exporter/pull/148
to provide OST data in 2.12 (note this PR currently doesn't compile).
1. The lustre documentation for the Fileset Feature is confusing/incorrect as it appears to be describing submounts which involve the client specifying a path in the filesystem, and are hence voluntary, with filesets where the client only specifies the filesystem to mount and the server only exports the subdirectory defined by the appropriate fileset. Submount functionality is not exposed by this code. ↩