10000 Install loops forever, no easy way to trace install progress of diskful nodes. · Issue #4582 · xcat2/xcat-core · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Install loops forever, no easy way to trace install progress of diskful nodes. #4582
Closed
@whowutwut

Description

@whowutwut

There's no way to easily trace the process of installing a compute node in the xCAT log to determine what causes a compute node to fail getting the chain properties updated correctly and stop from going in an infinite loop. (Essentially #boot needs to be set in the petitboot config file)

In some of my debug, I came up with the following high level trace of what is going on...

post install processing

The interesting script is post.xcat , inside here:

  1. downloads postscripts
  2. executes mypostscript.post
  3. Some point, we run nodeset <> next
  4. This triggers petitboot.pm to run rsetboot <> default (This seem to cause some failure and resulted in the rest of the script not executing, did it kill monitor process?)
  5. Then runs makedhcp
  6. Then runs setstate()
  7. Inside setstate() it takes the chain.currentstate and either writes #boot into petitboot config file or generates the kickstart file again

If something here fails, inside post processing, it seems like we will be in an infinite install loop.

Debug install logs

Target node: mid05tor12cn05

Started install diskful here:

====================================================
[Date]       2017-12-21 13:46:03
[ClientType] cli
[Request]    rpower mid05tor12cn05 reset
[Response]

Attaching the full cluster.log from provisioning:
mid05tor12cn05.cluster.log

If we do the following command: grep mid05tor12cn05 /var/log/xcat/cluster.log

it doesn't contain things like:

[root@briggs01 xcat]# grep next mid05tor12cn05.cluster.log
Dec 21 13:59:09 briggs01 xcat[11497]: DEBUG xcatd: dispatch request 'nodeset next' to plugin 'petitboot'
Dec 21 13:59:09 briggs01 xcat[11497]: DEBUG xcatd: dispatch request 'runbeginpre next' to plugin 'prescripts'
Dec 21 13:59:09 briggs01 xcat[11497]: DEBUG xcatd: dispatch request 'setdestiny next' to plugin 'destiny'
Dec 21 13:59:10 briggs01 xcat[11497]: DEBUG xcatd: dispatch request 'runendpre next' to plugin 'prescripts'

So how can we get a full picture of what is going on, how do we see when problems happen so Admins know how to resolve this?

Logs need to have consistent searchable keywords

[root@briggs01 xcat]# grep -i error mid05tor12cn05.cluster.log
[root@briggs01 xcat]# grep -i failed mid05tor12cn05.cluster.log
Dec 21 13:59:07 mid05tor12cn05 xcat:  failed to download precreated mypostscript, trying to generate with getpostscript.awk
[root@briggs01 xcat]#

This is a success case, I don't know how we could inject errors along the way to see if we could actually figure out why the chain table does not get updated..

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0