8000 [DRAFT] Cluster reboot tool by zveinn · Pull Request #20967 · minio/minio · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[DRAFT] Cluster reboot tool #20967

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from
Closed

Conversation

zveinn
Copy link
Contributor
@zveinn zveinn commented Feb 21, 2025

This is a helper too which let's clients reboot minio clusters, 1 server per erasure set at a time.

@zveinn zveinn force-pushed the cluster-reboot-helper branch from 586acc0 to 6649f16 Compare February 21, 2025 09:43
Copy link
Contributor
@klauspost klauspost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... You generate the host files, and then run each round manually, correct? Then run a health check before starting each round?

Each round reboots max one server per set.

Gives a good amount of visibility, and close enough to optimal to be fine. Certainly an improvement over clushing.


func makeHostfile() {
pools, totalServers, err := getInfra()
var rebootRounds [200][200]map[string]*Server
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
var rebootRounds [200][200]map[string]*Server
const maxRounds = 200
const maxPools = 200
var rebootRounds [maxRounds][maxPools]map[string]*Server

That is correct, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed

Comment on lines 297 to 310
if !ok {

for _, rv := range rebootRounds[i][pid] {
if haveMatchingSets(rv, s) {
continue nextServer
}
}

rebootRounds[i][pid][s.Endpoint] = pools[pid].Servers[sid]
pools[pid].Servers[sid].Processed = true
processed++
} else {
continue
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if !ok {
for _, rv := range rebootRounds[i][pid] {
if haveMatchingSets(rv, s) {
continue nextServer
}
}
rebootRounds[i][pid][s.Endpoint] = pools[pid].Servers[sid]
pools[pid].Servers[sid].Processed = true
processed++
} else {
continue
}
if ok {
continue
}
for _, rv := range rebootRounds[i][pid] {
if haveMatchingSets(rv, s) {
continue nextServer
}
}
rebootRounds[i][pid][s.Endpoint] = pools[pid].Servers[sid]
pools[pid].Servers[sid].Processed = true
processed++

}
for _, rv3 := range rv2 {
// _, err = roundFile.WriteString(rv3.Endpoint + "\n")
_, err = roundFile.WriteString(strings.Replace(rv3.Endpoint, ":443", "", -1) + "\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for :80 ---- or any port???

Seems like you don't expect port to be on the host name in the file, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a split instead

if err != nil {
panic(err)
}
hostsList := bytes.Split(hosts, []byte{10})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for ease of reading...

Suggested change
hostsList := bytes.Split(hosts, []byte{10})
hostsList := bytes.Split(hosts, []byte{'\n'})

if len(v) < 1 {
continue
}
rebootServer(string(v))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kinda expected these to all be concurrent. These are split in "safe" sets to do that... But maybe that should just be an opt-in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, before you use this command you'd use the "hostfile" command to generate a hostfile which contains a list of hosts that do not share erasure sets

config := &ssh.ClientConfig{
User: "root",
HostKeyCallback: ssh.InsecureIgnoreHostKey(),
Timeout: 10 * time.Second,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is a dial timeout to establish.

You probably want some "operation" timeout as well... So you don't risk that hanging forever, Not sure how to do that on the session, though.

}
defer session.Close()

output, err := session.CombinedOutput("echo 'this is where we execute the command'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, prolly want command as an argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a client testing this out and we don't want to run the actual commands yet :)

Copy link
Contributor
8000

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completely agree. But I was thinking making it a flag - with this as the default. They can then replace it with their actual restart command once they are happy.

}
}

func haveMatchingSets(s1 *Server, s2 *Server) (yes bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func haveMatchingSets(s1 *Server, s2 *Server) (yes bool) {
// haveMatchingSets returns if s1 and s2 share any sets.
// Servers are assumed to be in same pool.
func haveMatchingSets(s1 *Server, s2 *Server) (yes bool) {

Copy link
stale bot commented Jul 19, 2025

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jul 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0