8000 Disk caching by srett · Pull Request #683 · epoupon/lms · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Disk caching #683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed

Disk caching #683

wants to merge 7 commits into from

Conversation

srett
Copy link
Contributor
@srett srett commented May 29, 2025

My attempt at implementing disk caching.
Introduces a new config variable transcode-cache-size. Set to <> 0 to enable.
It's supposed to be the maximum size of the cache later on, but there is no cleanup worker yet. In fact there's still quite some TODOs in the code, and a bit too much debug spam.
But I thought I'd rather share this sooner than later for feedback, so I tried adding comments in places I thought might benefit from explaining the reasoning behind it. Code is probably also quite ugly in places due to focus on getting things to work and lack of knowledge regarding modern C++ and boost. :)

(Please ignore the first 2 commits entirely, a proper rebase will be done once this is in acceptable shape.)

srett added 7 commits May 23, 2025 16:44
Keeping the file open allows better prefetching/caching on both
the application layer (ifstream) and OS level (vfs layer) as
the access pattern is sequential and predictable.
While common C++ documentation seems to repeatedly state how
seekg/tellg are for setting/getting the read position in the stream,
and seekp/tellp doing the same for the write position, I had to ask
FSKING ChatGPT why the f*** the output files keep getting corrupted
and it then telling me that on some platforms, namely Linux/POSIX,
they are the same pointer since they just wrap native POSIX I/O,
which only knows one shared offset for reading and writing.
Good job not mentioning this anywhere important, like idk,
maybe cppreference.com or such. >:(
- Don't set FD_CLOEXEC, but close manually before fork:
  Since we don't use pipe2 to set the flag atomically, we're prone to
  races with parallel fork+execs either way. As ChildProcess uses a
  mutex here anyways, as long as there is no other part in lms that
  would fork+exec, we're not any more or less safe now, but the code
  is shorter and we cannot fail the fcntl.
- Do not close stdin and stderr. Again ffmpeg seems to handle this fine,
  but if a process tries to write to stderr and fd 2 is not open, it
  might just bail out. Even worse, the process might have opened
  some file it wants to work with, and that file got assigned fd 2
  as that was the next free fd - the program would corrupt whatever
  file it opened there whenever it tries to write to stderr.
  Try to open /dev/null instead for stdin and stderr, and if that
  fails, keep whatever lms inherited.
- exec does not return on success, no need to check return code,
  any return is a failure
- wrong error message in error path when assigning fd to boost stream
@epoupon
Copy link
Owner
epoupon commented May 30, 2025

Hello!
This is not a trivial subject so thanks for working on this :)

You are right, sharing now is better, so that we can discuss general design.
As a general rule, the bigger the PR the less chance it has to be integrated :(
So first I would say that changes like in the last commit (on the ChildProcessManager) should be set in a separate PR.

Can you also rebase on develop?

@epoupon
Copy link
Owner
epoupon commented May 30, 2025

The modifications on FileResource seem to be unrelated?
Code has been copied from wt's implementation. I think ifs is not stored since a server may serve thousands of concurrent requests in parallel, and you can't have infinite file descriptors open. If it is no big deal, I would just not change that.

I think we need a transcoding service, that would be used from subsonic code or from the ui code.
If caching is enabled, that would be its job to handle it, it should not be in the scope of the "av" library which should be as low level as possible (BTW, this "av" library is expected to be heavily (if not entirely) rewritten since the ultimate plan is to drop ffmpeg at some point and directly use its underlying libraries)

So a first step would be to move existing code to a new service, that does not even do caching, and then add the caching feature.
I would shift the resource handlers from av to this new service and just expose a "ITranscoder" interface from av.

@srett
Copy link
Contributor Author
srett commented May 30, 2025

You are right, sharing now is better, so that we can discuss general design. As a general rule, the bigger the PR the less chance it has to be integrated :(

Thanks for getting back this quickly! Yeah I could see that coming. Worst case at least the concept survives. ;)

So first I would say that changes like in the last commit (on the ChileProcessManager) should be set in a separate PR.

Agreed, I'll make several smaller PRs.

Regarding FileResource:
The reasoning behind it was for one the overhead of repeatedly opening and closing the same file over and over again, resulting in 3 syscalls (open, lseek, close) per chunk (256kb). Also as stated in the commit message, it might confuse the vfs layer of the OS, and if ifstream prefetches data, we'd essentially throw that away every time and re-read it on the next iteration. I guess there is no lms instance so big where this would really be the bottleneck ever, but the same probably goes for file descriptor exhaustion. :)
But the reason for the change in particular regarding disk caching was avoiding a race condition once a cleanup worker is implemented. While I agree it is mostly theoretical and unlikely in practice, you could now delete a file just being served, and the transfer would get interrupted on the next call to processRequest() where we try to open the file again. If you just open it once, the file can safely be deleted while it is still being held open by the FileResourceHandler. That's what I added the sourceGood() method for that is being checked after creating the instance, and if it returns false I assume the file vanished in the meantime, and just transcode it again.

The changes in ChildProcess in particular were preparations for the TODO I left:
https://github.com/srett/lms/blob/0a4ceb2c21c1fde3d2b271d8beae8e170958d4df/src/libs/av/impl/CachingTranscoderSession.cpp#L41-L53

Which should probably also be another PR after the basic version looks good. :)

Can you also rebase on develop?

Sure, only realized this after opening the PR

So a first step would be to move existing code to a new service, that does not even do caching, and then add the caching feature.
I would shift the resource handlers from av to this new service and just expose a "ITranscoder" interface from av.

So this should be the first step now I guess before doing any more work. Should I just try to figure out how that should look like, or would you prefer doing the refactoring of the existing code and then I add the rest?

(BTW, this "av" library is expected to be heavily (if not entirely) rewritten since the ultimate plan is to drop ffmpeg at some point and directly use its underlying libraries)

Unrelated but just my 2ct, I actually prefer the current design. It allows users to quickly swap out the binary for an older/newer version for testing, and avoids dealing with a large API surface, resource handling, and in case there is for example a corrupted file triggering a crash bug, only the ffmpeg process will die, and not the entire lms instance.

@epoupon 8000
Copy link
Owner
epoupon commented May 30, 2025

Regarding FileResource: The reasoning behind it was for one the overhead of repeatedly opening and closing the same file over and over again, resulting in 3 syscalls (open, lseek, close) per chunk (256kb). Also as stated in the commit message, it might confuse the vfs layer of the OS, and if ifstream prefetches data, we'd essentially throw that away every time and re-read it on the next iteration. I guess there is no lms instance so big where this would really be the bottleneck ever, but the same probably goes for file descriptor exhaustion. :) But the reason for the change in particular regarding disk caching was avoiding a race condition once a cleanup worker is implemented. While I agree it is mostly theoretical and unlikely in practice, you could now delete a file just being served, and the transfer would get interrupted on the next call to processRequest() where we try to open the file again. If you just open it once, the file can safely be deleted while it is still being held open by the FileResourceHandler. That's what I added the sourceGood() method for that is being checked after creating the instance, and if it returns false I assume the file vanished in the meantime, and just transcode it again.

Fair enough! Not sure about sourceGood() though, would have just throw/catch an exception in case of file access error.

So this should be the first step now I guess before doing any more work. Should I just try to figure out how that should look like, or would you prefer doing the refactoring of the existing code and then I add the rest?

Please don't waste time on this, I prefer you to focus on more useful parts of the code :)
So I will do the first step myself asap.

Unrelated but just my 2ct, I actually prefer the current design. It allows users to quickly swap out the binary for an older/newer version for testing, and avoids dealing with a large API surface, resource handling, and in case there is for example a corrupted file triggering a crash bug, only the ffmpeg process will die, and not the entire lms instance.

Yes, I admit there are some drawbacks indeed. Last time I tried (many years ago), the ffmpeg code path to transcode had some annoying hacks, but maybe that's better now. That's the advantage of isolating this in a low level library, that should be a pure implementation detail and should have no impact on the rest of the code.
That being said, lms's av already doesn't rely on ffprobe to inspect files, so using another ffmpeg instance could introduce some discrepancies. (in order to implement the newer OpenSubsonic transcoding API, av will be used to probe and report codecs etc.)

@epoupon
Copy link
Owner
epoupon commented May 31, 2025

Ok I put what I had in mind on develop. it is largely untested, but I think it should be good enough to start from

@srett
Copy link
Contributor Author
srett commented Jun 2, 2025

New version soon...

@srett srett closed this Jun 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0