Description
There's no way to easily trace the process of installing a compute node in the xCAT log to determine what causes a compute node to fail getting the chain properties updated correctly and stop from going in an infinite loop. (Essentially #boot
needs to be set in the petitboot config file)
In some of my debug, I came up with the following high level trace of what is going on...
post install processing
The interesting script is post.xcat , inside here:
- downloads postscripts
- executes mypostscript.post
- Some point, we run nodeset <> next
- This triggers petitboot.pm to run rsetboot <> default (This seem to cause some failure and resulted in the rest of the script not executing, did it kill monitor process?)
- Then runs makedhcp
- Then runs setstate()
- Inside setstate() it takes the chain.currentstate and either writes #boot into petitboot config file or generates the kickstart file again
If something here fails, inside post processing, it seems like we will be in an infinite install loop.
Debug install logs
Target node: mid05tor12cn05
Started install diskful here:
====================================================
[Date] 2017-12-21 13:46:03
[ClientType] cli
[Request] rpower mid05tor12cn05 reset
[Response]
Attaching the full cluster.log
from provisioning:
mid05tor12cn05.cluster.log
If we do the following command: grep mid05tor12cn05 /var/log/xcat/cluster.log
it doesn't contain things like:
[root@briggs01 xcat]# grep next mid05tor12cn05.cluster.log
Dec 21 13:59:09 briggs01 xcat[11497]: DEBUG xcatd: dispatch request 'nodeset next' to plugin 'petitboot'
Dec 21 13:59:09 briggs01 xcat[11497]: DEBUG xcatd: dispatch request 'runbeginpre next' to plugin 'prescripts'
Dec 21 13:59:09 briggs01 xcat[11497]: DEBUG xcatd: dispatch request 'setdestiny next' to plugin 'destiny'
Dec 21 13:59:10 briggs01 xcat[11497]: DEBUG xcatd: dispatch request 'runendpre next' to plugin 'prescripts'
So how can we get a full picture of what is going on, how do we see when problems happen so Admins know how to resolve this?
Logs need to have consistent searchable keywords
[root@briggs01 xcat]# grep -i error mid05tor12cn05.cluster.log
[root@briggs01 xcat]# grep -i failed mid05tor12cn05.cluster.log
Dec 21 13:59:07 mid05tor12cn05 xcat: failed to download precreated mypostscript, trying to generate with getpostscript.awk
[root@briggs01 xcat]#
This is a success case, I don't know how we could inject errors along the way to see if we could actually figure out why the chain table does not get updated..