8000 mcelog plugin by krzysztofx · Pull Request #2003 · collectd/collectd · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

mcelog plugin #2003

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Dec 26, 2016
Merged

mcelog plugin #2003

merged 4 commits into from
Dec 26, 2016

Conversation

krzysztofx
Copy link
Contributor

Plugin for collecting machine check exceptions reported by mcelog
The plugin at current stage does the following:
• Checks mcelog server liveliness, reports a failure if it’s not running of if it fails.
• Retrieve Memory Corrected and Uncorrected Errors from mcelog Unix socket (client protocol)
• Publishes collected data as notifications and statistics(values)

@krzysztofx krzysztofx changed the title Feat ras mcelog plugin Oct 26, 2016
@rubenk
Copy link
Contributor
rubenk commented Oct 26, 2016

Thanks for your contribution @krzysztofx!
I'll leave some comments inline.

@@ -926,6 +931,11 @@ Prerequisites
`varnish' plugin.
<http://varnish-cache.org>

* mcelog
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to add this block in this section, since it is mcelog is not a library and needed for building collectd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, it's not compilation prerequisite

- mbmon
Motherboard sensors: temperature, fan speed and voltage information,
using mbmon(1).

- mcelogmon
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plugin is called mcelog

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix that

@@ -6314,6 +6314,7 @@ AC_PLUGIN([match_regex], [yes], [The regex match])
AC_PLUGIN([match_timediff], [yes], [The timediff match])
AC_PLUGIN([match_value], [yes], [The value match])
AC_PLUGIN([mbmon], [yes], [Query mbmond])
AC_PLUGIN([mcelog], [yes], [Machine Check Exceptions notifications])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This always builds and installs mcelog, but I think it is Linux-specific?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I'll fix that

mcelog_la_SOURCES = mcelog.c
mcelog_la_LDFLAGS = $(PLUGIN_LDFLAGS)
mcelog_la_LIBADD =
if BUILD_WITH_LIBSOCKET
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this, it is mainly for Solaris iirc, and mcelog seems Linux-only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libsocket linkage removed

Copy link
Member
@octo octo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @krzysztofx,

thank you very much for your patch! I'm afraid the worker thread needs some more work, the indentation levels get out of hand and checking for else-blocks becomes impossible for many of the very long if statements. I also think that the function is too long and at least the parsing code should be moved to its own function.

Best regards,
—octo

typedef struct mcelog_config_s mcelog_config_t;
typedef struct mcelog_memory_rec_s mcelog_memory_rec_t;

mcelog_config_t g_mcelog_config;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The static keyword is missing here. Also, please initialize the struct where it's declared, e.g.:

static mcelog_config_t g_mcelog_config = {
  .socket_path = "/var/run/mcelog-client",
  .logfile = "/var/log/mcelog",
  .unix_sock = {
    .sun_family = AF_UNIX,
    .sun_path = "/var/run/mcelog-client",
  },
  .sock_fd = -1,
 };

Copy link
Contributor Author
@krzysztofx krzysztofx Nov 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to provide way to reinitalize g_mcelog_config struct, so how do you see usage of compound literals there ? I mean:

static mcelog_config_t g_mcelog_config ;
// skippded lines
static void mcelog_config_init_default(void) {
  g_mcelog_config = (mcelog_config_t){
    .logfile = "/var/log/mcelog",
    .unix_sock = {
      .sun_family = AF_UNIX,
      .sun_path = "/var/run/mcelog-client",
    },
    .tid = 0,
    .sock_fd = -1,
  };

#define POLL_TIMEOUT 1000 /* ms */

struct mcelog_config_s {
char socket_path[PATH_MAX]; /* mcelog client socket */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the path stored twice? Once here and once in unix_sock.sun_path.

CDTIME_T_TO_TIMEVAL(interval, &socket_timeout);

g_mcelog_config.unix_sock.sun_family = AF_UNIX;
g_mcelog_config.sock_fd = -1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a dead assignment.

static const char corrected_err[] = "corrected memory errors:";
static const char uncorrected_err[] = "uncorrected memory errors:";

static void mcelog_config_init_default(void) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a struct initializer, as mentioned above, to zero the struct. In particular, bzero() is not portable.

if (strcasecmp("McelogClientSocket", child->key) == 0) {
if (cf_util_get_string_buffer(child, g_mcelog_config.unix_sock.sun_path,
sizeof(g_mcelog_config.unix_sock.sun_path) -
1) < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you subtract 1 here? cf_util_get_string_buffer() guarantees null termination.

DEBUG("%s: done", MCELOG_PLUGIN);
break;
}
if (!memcmp(buf, socket_str, strlen(socket_str))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has a high risk of reading uninitialized memory from buf. Please use strncmp() instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

pthread_exit(NULL);
}
pthread_mutex_unlock(&mcelog_thread_lock);
int poll_ret = poll(&poll_fd, 1, POLL_TIMEOUT);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth sending this thread a signal (using pthread_kill() from the shutdown() callback) instead of busy looping here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean adding signal handler to thread function poll_worker ? Or just terminating thread asynchronously sending SIGTERM ? I have noticed that some plugins use that approach but I think this is not applicable here since we have to do some cleanup like calling fclose and so on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will terminate thread with thread with pthread_cancel and put cleanup code into callback function set by pthread_cancel_push. Is it ok from your POV ?

mcelog_thread_shutdown = 1;
pthread_mutex_unlock(&mcelog_thread_lock);
pthread_join(g_mcelog_config.tid, NULL);
pthread_mutex_destroy(&mcelog_thread_lock);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't destroy mutexes that you initialized with the static initializer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

pthread_mutex_unlock(&mcelog_thread_lock);
pthread_join(g_mcelog_config.tid, NULL);
pthread_mutex_destroy(&mcelog_thread_lock);
shutdown(g_mcelog_config.sock_fd, 0);
Copy link
Member

Choose a reason for F438 hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check that sock_fd is in fact a valid file descriptor before calling shutdown().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

static _Bool mcelog_thread_shutdown = 0;
static pthread_mutex_t mcelog_thread_lock = PTHREAD_MUTEX_INITIALIZER;

static const char socket_str[] = "SOCKET";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having these const char[] up here doesn't really improve the parsing code below. Please simply use string literals in the worker thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change it

@krzysztofx
Copy link
Contributor Author

@octo @rubenk thanks for your reviews. I'll respond your comments soon.

@krzysztofx
Copy link
Contributor Author

Hi @octo @rubenk
I've pushed patch addressing your comments and also fix for build failure on epel5 setup.
Please let me know if you will find out anything else that could be improved.

and reported to software) reported by mcelog and generate appropriate
notifications when machine check errors are detected.
- mcelog
Logs and accounts machine check exceptions (in particular memory, IO, and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked the original text better. The plugin doesn't actually log MCE's right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It logs only the amount of corrected/uncorrected memory related errors per memory SOCKET and CHANNEL. No MCE error codes are logged if that's what you mean. Data source is equivalent to output of 'mcelog --client', for example:

[root@silpixa00392165 kmatczax]# mcelog --client
Memory errors
SOCKET 0 CHANNEL any DIMM any
corrected memory errors:
        0 total
        0 in 24h
uncorrected memory errors:
        8 total
        8 in 24h

SOCKET 0 CHANNEL 0 DIMM any
corrected memory errors:
        0 total
        0 in 24h
uncorrected memory errors:
        8 total
        8 in 24h

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revert to previous mcelog plugin description.

@krzysztofx
Copy link
Contributor Author

Hi

I've committed fix for typedef redefinition. All check passed but now I see merge conflict with no details given. Please let me know if I could do anything to fix that, or if any other changes in plugin code are still required.

@krzysztofx
Copy link
Contributor Author

@rubenk @octo
Would you mind If I rebase that PR against master branch to fix the conflict ?

@krzysztofx
Copy link
Contributor Author

Rebase done

@krzysztofx
Copy link
Contributor Author

Hi
@rubenk @octo
Sorry for pinging , but is there anything that could be done to get that plugin merged? If yes, please let me know.
I'm developing extension for that plugin that collects all MCE exceptions, so it would be easier to base on already merged version.

@rubenk
Copy link
Contributor
rubenk commented Dec 19, 2016

Could you re-order or squash some of your commits? You now have multiple commits with the same commit messages.

pkglib_LTLIBRARIES += mcelog.la
mcelog_la_SOURCES = mcelog.c
mcelog_la_LDFLAGS = $(PLUGIN_LDFLAGS)
mcelog_la_LIBADD =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not needed.

@@ -3322,6 +3322,28 @@ TCP-Port to connect to. Defaults to B<411>.

=back

=head2 Plugin C<mcelog>

The C<mcelog plugin> uses mcelog to retrieve machine check exceptions, etc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does the 'etc.' stand for? It doesn't add much information, I'd either leave it out or extend the sentence.

mcelog server is running. When the server is running, the plugin will tail the
specified logfile to retrieve machine check exception information and send a
notification with the details from the logfile. The plugin will use the mcelog
client protocol to retrieve memory errors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer 'retrieve machine check exceptions' instead of 'memory errors'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment mcelog's client/server protocol returns only memory related MCE's, see http://www.mcelog.org/protocol.html. That's why I used 'memory errors' term. Would you like to extend it to all MCE, or maybe use term 'memory related machine check exceptions' ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks for the explanation. The latter sounds good.

=over 4

=item B<McelogClientSocket> I<Path>
Connect to mcelog client socket using the UNIX domain socket at I<Path>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connect to the mcelog client socket


=item B<McelogLogfile> I<Path>

mcelog file to parse. Defaults to B<"/var/log/mcelog">.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mcelog logfile to parse.

@krzysztofx
Copy link
Contributor Author

@rubenk
Thanks for your review. I'll squash redundant commits and address your comments soon

Thanks
Krzysztof


/* synchronization via write lock since sock_fd may be changed here */
pthread_rwlock_wrlock(&self->lock);
self->sock_fd = socket(PF_UNIX, SOCK_STREAM, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use SOCK_CLOEXEC here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and while you're at it, add SOCK_NONBLOCK to the flags, so you don't have to do the fcntl later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regarding SOCK_NONBLOCK please see d212c7f. It failed on epel5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, sorry for the confusion. The situation has changed now though, after we released 5.7 we dropped support for RHEL5 from master. Jenkins is still doing build on RHEL5 but those will be cleaned up shortly and you can ignore them.


/* Set socket timeout option */
if (setsockopt(self->sock_fd, SOL_SOCKET, SO_SNDTIMEO,
(char *)&socket_timeout, sizeof(socket_timeout)) < 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the (char *) cast is unneeded.

static void *poll_worker(__attribute__((unused)) void *arg) {
char errbuf[MCELOG_BUFF_SIZE];
mcelog_thread_running = 1;
FILE **pp_file = calloc(1, sizeof(FILE *));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use sizeof on the var, not the type, so sizeof(*pp_file)

continue;

mcelog_memory_rec_t memory_record;
memset(&memory_record, 0, sizeof(memory_record));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use designated initialiser here, so mcelog_memory_rec_t memory_record = {0}

@krzysztofx
Copy link
Contributor Author
krzysztofx commented Dec 23, 2016

My apologies for squashing by mistake all previous commits into c8f0822. I hope it doesn't make it much harder to review. If you think I could somehow revert that rebase and squash only specific commits please let me know.
All recent comments have been addressed in 6016e1d

Now I see merge conflict with the master branch. Do you agree for rebase ?

Thanks
Krzysztof

@krzysztofx
Copy link
Contributor Author

pull-requests-github_trigger-aggregation failed due to:

mcelog.c:437:5: error: missing braces around initializer [-Werror=missing-braces]
    mcelog_memory_rec_t memory_record = {0};

Looks like this job has more restrictive compiler settings than others . Should I put additional braces like {{0}} or ignore that fail ?

@rubenk
Copy link
Contributor
rubenk commented Dec 23, 2016

No, that's not it, the compiler settings are the same. This job is the first that runs and the other ones are only run if this one succeeds.

This looks like https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53119.

Perhaps you could try changing the order of the members of the mcelog_memory_rec_s struct around to work around it? My guess is that this won't happen if the first member is an int instead of char[]

@krzysztofx
Copy link
Contributor Author

@rubenk
Changing order of struct members helped for missing braces warning. Now as expected I got failures on epel5 builds due to unknown socket flags:

mcelog.c: In function 'socket_reinit':
mcelog.c:163: error: 'SOCK_CLOEXEC' undeclared (first use in this function)
mcelog.c:163: error: (Each undeclared identifier is reported only once
mcelog.c:163: error: for each function it appears in.)
mcelog.c:163: error: 'SOCK_NONBLOCK' undeclared (first use in this function)
make[3]: *** [mcelog.lo] Error 1

So according to previous comments this is acceptable, right ?
Please let me know if I should rebase to master.

thanks
Krzysztof

@rubenk
Copy link
Contributor
rubenk commented Dec 25, 2016

Thanks @krzysztofx, yes, these errors on EPEL5 are perfectly alright.
Please rebase and then I'll merge.

Maryam Tahhan and others added 4 commits December 25, 2016 20:23
Signed-off-by: Krzysztof Matczak <krzysztofx.matczak@intel.com>
Change-Id: I0ddfd6dcd911ab46fcbc032081a2980b1f1f549b
Signed-off-by: Krzysztof Matczak <krzysztofx.matczak@intel.com>
Change-Id: I77a8ca1d5854ebda8533483cdd5ea08c3c07a059
Signed-off-by: Krzysztof Matczak <krzysztofx.matczak@intel.com>
Change-Id: I42b4b392da942808d29b73a13cd5e0bbcef9b6b8
Signed-off-by: Krzysztof Matczak <krzysztofx.matczak@intel.com>
@krzysztofx
Copy link
Contributor Author

@rubenk
Rebase done

Thanks
Krzysztof

@rubenk rubenk merged commit 38dd976 into collectd:master Dec 26, 2016
@vmytnykx vmytnykx mentioned this pull request Dec 26, 2016
@krzysztofx krzysztofx deleted the feat_ras branch February 17, 2017 09:59
@octo octo added this to the 5.8 milestone Oct 11, 2017
@maryamtahhan maryamtahhan mentioned this pull request Oct 12, 2017
19 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0