-
Notifications
You must be signed in to change notification settings - Fork 32
Preventing out of control DB growth #98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The situation is a bit more complicated. I will try to explain.
So, what can we do? The root issue is not the size of the serialized commit in the database, it's that we create so many of them during normal operations.
All in all I would argue this can wait to 0.7.0, since it's "just" an inefficiency. Long-term we need a solution. Hope that brings some clarity. |
Thanks. Now I understand why git has Now let me repeat back the main point (please correct me if I am wrong):
If I understood it correctly, we follow the But this way maybe a wrong way, I agree this is fight for v0.7.0. But let's have a draft of a solution.
Downside of this approach: a slower startup (could be fast with check points) |
Both points are correct. I think this drawing illustrates it well:
Also correct. Although it's not the
The reason git is slow for big files is that it tries to compute diff, even for large binary files. The solution there is to use Git LFS, which does something very similar to brig: Just store pointers to the file content and version that. My point here is that the slowness is not coming from the use of a MDAG, but rather from storing the files themselves in it and diffing them.
That's roughly what projects like Pijul or Darcs are doing. Both are interesting, but never found wide-spread adoption. Especially darcs hit issues with real-world usage and I consider Pijul still more as a research project. Very interesting, but definitely not as well understood as MDAG. We simply might run into other problems with this approach. Also remember switch to a diff storage approach would essentially mean a rewrite of large parts of brig. Bit much to solve the original problem stated above. 😄 |
Uh oh!
There was an error while loading. Please reload this page.
Aside issues of garbage collecting which discussed in #92, we have a bigger problem.
A repo with about 1000 files (with short names) has a commit value size of about 340 kB.
Doing
adds new commit value to DB with about 340kB+.
So our commits are large and keep growing with repo sizes. Leading to DB size growth which is faster than stored files size.
There are also additional GC issues:
We add to be GCollected size with obsolete value of large sizes, since every touch by itself creates the HEAD type commit (of above 340KB size). It is erased with new
touch
but stays in DB until GCollected.Also
commit
reputs a lot oftree./filename
type keys in the DB as wellobjects.2W9rNbTusHTTf...
seemingly related to a filename. So as number of files grows the number of key which are reput in DB growth. Thus we keep staffing DB with values to be GCollected.All of these leads to exponential growth of DB with every file addition even if we do GC.
We probably need to think about better commit format which reflects only difference in states.
The text was updated successfully, but these errors were encountered: