Install loops forever, no easy way to trace install progress of diskful nodes.

There's no way to easily trace the process of installing a compute node in the xCAT log to determine what causes a compute node to fail getting the chain properties updated correctly and stop from going in an infinite loop. (Essentially #boot needs to be set in the petitboot config file)

In some of my debug, I came up with the following high level trace of what is going on...

post install processing

The interesting script is post.xcat , inside here:

downloads postscripts
executes mypostscript.post
Some point, we run nodeset <> next
This triggers petitboot.pm to run rsetboot <> default (This seem to cause some failure and resulted in the rest of the script not executing, did it kill monitor process?)
Then runs makedhcp
Then runs setstate()
Inside setstate() it takes the chain.currentstate and either writes #boot into petitboot config file or generates the kickstart file again

If something here fails, inside post processing, it seems like we will be in an infinite install loop.

Debug install logs

Target node: mid05tor12cn05

Started install diskful here:

====================================================
[Date]       2017-12-21 13:46:03
[ClientType] cli
[Request]    rpower mid05tor12cn05 reset
[Response]

Attaching the full cluster.log from provisioning:
mid05tor12cn05.cluster.log

If we do the following command: grep mid05tor12cn05 /var/log/xcat/cluster.log

it doesn't contain things like:

[root@briggs01 xcat]# grep next mid05tor12cn05.cluster.log
Dec 21 13:59:09 briggs01 xcat[11497]: DEBUG xcatd: dispatch request 'nodeset next' to plugin 'petitboot'
Dec 21 13:59:09 briggs01 xcat[11497]: DEBUG xcatd: dispatch request 'runbeginpre next' to plugin 'prescripts'
Dec 21 13:59:09 briggs01 xcat[11497]: DEBUG xcatd: dispatch request 'setdestiny next' to plugin 'destiny'
Dec 21 13:59:10 briggs01 xcat[11497]: DEBUG xcatd: dispatch request 'runendpre next' to plugin 'prescripts'

So how can we get a full picture of what is going on, how do we see when problems happen so Admins know how to resolve this?

Logs need to have consistent searchable keywords

[root@briggs01 xcat]# grep -i error mid05tor12cn05.cluster.log
[root@briggs01 xcat]# grep -i failed mid05tor12cn05.cluster.log
Dec 21 13:59:07 mid05tor12cn05 xcat:  failed to download precreated mypostscript, trying to generate with getpostscript.awk
[root@briggs01 xcat]#

This is a success case, I don't know how we could inject errors along the way to see if we could actually figure out why the chain table does not get updated..

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

post install processing

Debug install logs

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

post install processing

Debug install logs

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions