8000 [RFC] Log forwarding code in nginx container is brittle and may lead to process stalls · Issue #3630 · Mailu/Mailu · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[RFC] Log forwarding code in nginx container is brittle and may lead to process stalls #3630

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
koraa opened this issue Nov 1, 2024 · 6 comments
Labels
type/enhancement Enhances existing functionality

Comments

@koraa
Copy link
koraa commented Nov 1, 2024

Hi all,

as per request in #3627, I am moving the discussion here.

Below are the key points from the other thread. Since the post copied below, @nextgens kindly pointed out that the stream forwarding functionality is required for LogFilter to work. For this reason, removing the entire stream forwarding functionality is not a suitable fix.

The remaining possibilities are:

  • Adding exception handlers to the log forwarding threads
  • Migrating away from the threaded log forwarder by:
    • Either using asyncio in the runner script
    • Using supervisord to manage subprocesses and moving the log filter to a dedicated filter process (e.g. as a stand-allone wrapper script).

Would a PR adding exception handling to the forwarder thread be desirable?


I gave the underlying Popen code a thorough review. What I found is, that it is surprisingly hard to use this API correctly.

I do think forward_text_lines still lacks error handling.

https://github.com/Mailu/Mailu/blob/aa52bbb731c374a707df41588a108cedaca71fc4/core/base/libs/socrate/socrate/system.py#L163C1-L166C32

Using Popen without text=True as was done in the patch does help. Popen's init always uses io.open(<file>, 'wb', <bufsize>) and wraps the streams in io.TextIOWrapper when text_mode=True. So without the flag, it returns file descriptor based streams.

https://github.com/python/cpython/blob/3275cb19530fb5c7115cf8313f1ada9621ed3a92/Lib/subprocess.py#L1018C13-L1033C62

So the functions being used are really:

One of the biggest potential sources of an exception would be the EINTR system error – i.e. the error that is thrown when a unix signal is sent to a process while a system call is being executed. There is some code in the above functions handling this case; there is also a lot of other very very complicated error handling and a lot of NULL propagation in the above code and I would not trust it to never raise exceptions.

A relatively easy fix would be to add exception handling to forward_text_lines and deliberately crash the entire process on an error:

import logging
import sys

def forward_text_lines(src, dst):
    try:
        while True:
            current_line = src.readline()
            dst.write(current_line)
    except Exception as e:
        logging.error(traceback.format_exc())
        sys.exit(1)

There is an even easier fix: Let the operating system handle forwarding of stdout & stderr. The subprocess.Popen documentation states:

stdin, stdout and stderr specify the executed program’s standard input, standard output and standard error file handles, respectively. Valid values are None, PIPE, DEVNULL, an existing file descriptor (a positive integer), and an existing file object with a valid file descriptor. With the default settings of None, no redirection will occur.

We could rewrite run_process_and_forward_output with this knowledge to let the operating system handle stream forwarding like this:

def run_process_and_forward_output(cmd):
    process = subprocess.Popen(cmd)
    return process.wait()

This – in essence – is the functionality of subprocess.run(), so the code simplifies to:

def run_process_and_forward_output(cmd):
    return subprocess.run(cmd).returncode

This should solve the issue of exceptions caused by stream forwarding comprehensively.

There are a couple more issues in the start.sh code caused by the script running multiple processes:

  • ./letsencrypt.py OR ./certwatcher.py OR nothing
  • dovecot -c /etc/dovecot/proxy.conf
  • /usr/sbin/nginx -g 'daemon off;'

On linux processes should be wait(2)-ed upon, otherwise zombie processes may be created. I suppose dovecot will take care of this itself since it deamonizes after after startup, but the python scripts won't.

This script also allows for the possibility that dovecot crashes with the container and nginx staying alive; this could lead to a degraded system.

Personally, I would solve these problems by rewriting start to just set LD_PRELOAD to load the hardened malloc implementation, provision the configuration, create a supervisord config to start all the subprocesses and finally execl into supervisord.

This would preserve the current functionality and outsource all the complex process management to [supervisord](http://supervisord.org/introduction.html].

https://github.com/Mailu/Mailu/blob/aa52bbb731c374a707df41588a108cedaca71fc4/core/nginx/start.py#L15C1-L18C42

@koraa koraa changed the title Subprocess code in nginx container is brittle Log forwarding code in nginx container is brittle and may lead to process stalls Nov 1, 2024
@koraa koraa changed the title Log forwarding code in nginx container is brittle and may lead to process stalls [RFC] Log forwarding code in nginx container is brittle and may lead to process stalls Nov 1, 2024
Copy link
stale bot commented Feb 1, 2025

Issues not for bugs, enhancement requests or discussion go stale after 21 days of inactivity. This issue will be automatically closed after 14 days.
For all metrics refer to the stale.yml file.
Github issues are not meant for user support. For user-support questions, reach out on the matrix support channel.

Mark the issue as fresh by simply adding a comment to the issue.
If this issue is safe to close, please do so now.

@stale stale bot added the status/response_needed Waiting for a response from the author label Feb 1, 2025
@koraa
Copy link
Author
koraa commented Feb 1, 2025

This issue persists.

@stale stale bot removed the status/response_needed Waiting for a response from the author label Feb 1, 2025
@nextgens nextgens added the type/enhancement Enhances existing functionality label Feb 9, 2025
@nextgens
Copy link
Contributor
nextgens commented Feb 9, 2025

Sorry for the radio silence, been busy with other things...

#3271 is another example of the same.

I agree that we should improve upon the status-quo. I am not very familiar with supervisor (I usually use daemontools) but would review a PR switching the various containers and processes to supervisor with a dedicated filter process.

@koraa
Copy link
Author
koraa commented Feb 10, 2025

Good to hear @nextgens; no worries, I know how busy a maintainer's life is.

Happy to hear!

I might be able to implement that.

Is there anyone we should loop into this to make sure everyone is on board on the mailu project side?

Would you mind if I cleaned up the startup script a bit while I am adding supervisord?

@nextgens
Copy link
Contributor

Is there anyone we should loop into this to make sure everyone is on board on the mailu project side?

You can reach-out on Matrix
https://app.element.io/#/room/#mailu-dev:matrix.org

Would you mind if I cleaned up the startup script a bit while I am adding supervisord?

To be discussed; I am not opposed to it in principle

@lel-amri
Copy link

One of the biggest potential sources of an exception would be the EINTR system error [...]. There is some code in the above functions handling this case; there is also a lot of other very very complicated error handling and a lot of NULL propagation in the above code and I would not trust it to never raise exceptions.

I believe CPython handles the EINTR case pretty well, but I wouldn't trust it to never raise exceptions either, as anyone should do for any length of Python code running on CPython.

A relatively easy fix would be to add exception handling to forward_text_lines and deliberately crash the entire process on an error [...]
[...]
Personally, I would solve these problems by rewriting start to just set LD_PRELOAD to load the hardened malloc implementation, provision the configuration, create a supervisord config to start all the subprocesses and finally execl into supervisord.

The former solution would be a small plug for a giant barrel potentially full of similar holes. I do like the second approach though, and not only does it better manage processes, it could also enable better logging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement Enhances existing functionality
Projects
None yet
Development

No branches or pull requests

3 participants
0