[RFC] Log forwarding code in nginx container is brittle and may lead to process stalls #3630

koraa · 2024-11-01T17:40:57Z

Hi all,

as per request in #3627, I am moving the discussion here.

Below are the key points from the other thread. Since the post copied below, @nextgens kindly pointed out that the stream forwarding functionality is required for LogFilter to work. For this reason, removing the entire stream forwarding functionality is not a suitable fix.

The remaining possibilities are:

Adding exception handlers to the log forwarding threads
Migrating away from the threaded log forwarder by:
- Either using asyncio in the runner script
- Using supervisord to manage subprocesses and moving the log filter to a dedicated filter process (e.g. as a stand-allone wrapper script).

Would a PR adding exception handling to the forwarder thread be desirable?

I gave the underlying Popen code a thorough review. What I found is, that it is surprisingly hard to use this API correctly.

I do think forward_text_lines still lacks error handling.

https://github.com/Mailu/Mailu/blob/aa52bbb731c374a707df41588a108cedaca71fc4/core/base/libs/socrate/socrate/system.py#L163C1-L166C32

Using Popen without text=True as was done in the patch does help. Popen's init always uses io.open(<file>, 'wb', <bufsize>) and wraps the streams in io.TextIOWrapper when text_mode=True. So without the flag, it returns file descriptor based streams.

https://github.com/python/cpython/blob/3275cb19530fb5c7115cf8313f1ada9621ed3a92/Lib/subprocess.py#L1018C13-L1033C62

So the functions being used are really:

_io_FileIO_read_impl in Modules/_io/fileio.c, line L853
_io_FileIO_write_impl in Modules/_io/fileio.c, line L916
_io__IOBase_readline_impl in Modules/_io/iobase.c, line 560
_Py_read Python/fileutils.c, line 1859
_Py_write Python/fileutils.c, line 2029

One of the biggest potential sources of an exception would be the EINTR system error – i.e. the error that is thrown when a unix signal is sent to a process while a system call is being executed. There is some code in the above functions handling this case; there is also a lot of other very very complicated error handling and a lot of NULL propagation in the above code and I would not trust it to never raise exceptions.

A relatively easy fix would be to add exception handling to forward_text_lines and deliberately crash the entire process on an error:

import logging
import sys

def forward_text_lines(src, dst):
    try:
        while True:
            current_line = src.readline()
            dst.write(current_line)
    except Exception as e:
        logging.error(traceback.format_exc())
        sys.exit(1)

There is an even easier fix: Let the operating system handle forwarding of stdout & stderr. The subprocess.Popen documentation states:

stdin, stdout and stderr specify the executed program’s standard input, standard output and standard error file handles, respectively. Valid values are None, PIPE, DEVNULL, an existing file descriptor (a positive integer), and an existing file object with a valid file descriptor. With the default settings of None, no redirection will occur.

We could rewrite run_process_and_forward_output with this knowledge to let the operating system handle stream forwarding like this:

def run_process_and_forward_output(cmd):
    process = subprocess.Popen(cmd)
    return process.wait()

This – in essence – is the functionality of subprocess.run(), so the code simplifies to:

def run_process_and_forward_output(cmd):
    return subprocess.run(cmd).returncode

This should solve the issue of exceptions caused by stream forwarding comprehensively.

There are a couple more issues in the start.sh code caused by the script running multiple processes:

./letsencrypt.py OR ./certwatcher.py OR nothing
dovecot -c /etc/dovecot/proxy.conf
/usr/sbin/nginx -g 'daemon off;'

On linux processes should be wait(2)-ed upon, otherwise zombie processes may be created. I suppose dovecot will take care of this itself since it deamonizes after after startup, but the python scripts won't.

This script also allows for the possibility that dovecot crashes with the container and nginx staying alive; this could lead to a degraded system.

Personally, I would solve these problems by rewriting start to just set LD_PRELOAD to load the hardened malloc implementation, provision the configuration, create a supervisord config to start all the subprocesses and finally execl into supervisord.

This would preserve the current functionality and outsource all the complex process management to [supervisord](http://supervisord.org/introduction.html].

https://github.com/Mailu/Mailu/blob/aa52bbb731c374a707df41588a108cedaca71fc4/core/nginx/start.py#L15C1-L18C42

The text was updated successfully, but these errors were encountered:

stale · 2025-02-01T00:07:49Z

Issues not for bugs, enhancement requests or discussion go stale after 21 days of inactivity. This issue will be automatically closed after 14 days.
For all metrics refer to the stale.yml file.
Github issues are not meant for user support. For user-support questions, reach out on the matrix support channel.

Mark the issue as fresh by simply adding a comment to the issue.
If this issue is safe to close, please do so now.

koraa · 2025-02-01T09:43:09Z

This issue persists.

nextgens · 2025-02-09T10:33:35Z

Sorry for the radio silence, been busy with other things...

#3271 is another example of the same.

I agree that we should improve upon the status-quo. I am not very familiar with supervisor (I usually use daemontools) but would review a PR switching the various containers and processes to supervisor with a dedicated filter process.

koraa · 2025-02-10T15:07:49Z

Good to hear @nextgens; no worries, I know how busy a maintainer's life is.

Happy to hear!

I might be able to implement that.

Is there anyone we should loop into this to make sure everyone is on board on the mailu project side?

Would you mind if I cleaned up the startup script a bit while I am adding supervisord?

nextgens · 2025-02-10T15:14:06Z

Is there anyone we should loop into this to make sure everyone is on board on the mailu project side?

You can reach-out on Matrix
https://app.element.io/#/room/#mailu-dev:matrix.org

Would you mind if I cleaned up the startup script a bit while I am adding supervisord?

To be discussed; I am not opposed to it in principle

lel-amri · 2025-03-29T09:47:32Z

One of the biggest potential sources of an exception would be the EINTR system error [...]. There is some code in the above functions handling this case; there is also a lot of other very very complicated error handling and a lot of NULL propagation in the above code and I would not trust it to never raise exceptions.

I believe CPython handles the EINTR case pretty well, but I wouldn't trust it to never raise exceptions either, as anyone should do for any length of Python code running on CPython.

A relatively easy fix would be to add exception handling to forward_text_lines and deliberately crash the entire process on an error [...]
[...]
Personally, I would solve these problems by rewriting start to just set LD_PRELOAD to load the hardened malloc implementation, provision the configuration, create a supervisord config to start all the subprocesses and finally execl into supervisord.

The former solution would be a small plug for a giant barrel potentially full of similar holes. I do like the second approach though, and not only does it better manage processes, it could also enable better logging.

koraa changed the title ~~Subprocess code in nginx container is brittle~~ Log forwarding code in nginx container is brittle and may lead to process stalls Nov 1, 2024

koraa changed the title ~~Log forwarding code in nginx container is brittle and may lead to process stalls~~ [RFC] Log forwarding code in nginx container is brittle and may lead to process stalls Nov 1, 2024

stale bot added the status/response_needed Waiting for a response from the author label Feb 1, 2025

stale bot removed the status/response_needed Waiting for a response from the author label Feb 1, 2025

nextgens added the type/enhancement Enhances existing functionality label Feb 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Log forwarding code in nginx container is brittle and may lead to process stalls #3630

[RFC] Log forwarding code in nginx container is brittle and may lead to process stalls #3630

[RFC] Log forwarding code in nginx container is brittle and may lead to process stalls #3630

[RFC] Log forwarding code in nginx container is brittle and may lead to process stalls #3630

Comments