-
Notifications
You must be signed in to change notification settings - Fork 122
vine: possible hash table ineficiencies #4115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
+1 for writing a benchmark that evaluates the performance of the hash table independently of TaskVine. Perhaps @Ian2327 would like to take a crack at that? |
Sure! I'll take that challenge. Should I write the test in C or some other language? |
I think C is a good language in which to do the test. Basically, we need a benchmark the measures the time to iterate the hash table for 100K, 200K ... 900K items. Then start removing them and measure the iteration time. Does it go back down? |
I should be testing the |
That's correct. |
These are the results I got:
It doesn't seem like the time after removing the elements is returning to the original range before adding the elements. |
After reading through the |
Correct, we only implemented the doubling as at the time we didn't see a reason to reduce. Also, I'm not convinced that this is really an issue, or at least causing the observed slowdown. The reason is that the manager still spends some time in link_poll, which means that it is working under capacity, so this particular bottleneck is most likely coming from somewhere else. |
While the total impact on taskvine is unclear, I think this is a clear deficiency in the data structure, and it will be a good exercise for @Ian2327 to fix it. Ian, please go ahead and work up a PR, and then let's discuss. Yes, you will have to implement the halving function, and @btovar is correct that some hysteresis will be needed. |
Do we know which table is causing this? It may be that the tables are not really empty, but taskvine needs to keep tabs of all that info (e.g., files that have not been undeclared.), or there is a memleak where we are not deleting things from the tables when we should. Which type of workflow is this? Are there dependencies on the tasks? We have run workflows bigger than that before, and the hash tables were not such a bottleneck. I'm a little hesitant to add more dynamic complexity to the code without being sure that this is really the bottleneck we think it is. |
@btovar I am 60% sure it was the |
But at most you have O(1000) workers joined, right? That should be ok for these tables? |
Yes, I agree with this, it could be that the manager got stuck with something else and gave this code a high priority to run. |
In this case, I think we could change |
In fact |
But it calls |
Yes, you are correct, good catch. |
@Ian2327 we will continue to argue about what contributes to TaskVine performance. :) In the meantime, I would like you to write and test a modification to the hash table that shrinks the table by 1/2 whenever the load falls below 1/8. Then let's see what that performance looks like. |
This is what I got after implementing the halving buckets:
This was the run before the implementation:
|
Hmm, maybe increase the number of items by a factor of 100 to get more consistent runtimes? |
This is what I got:
(without halving):
|
I guess you won't see a significant performance difference since you're only testing insertion and removal, both of which are And the halving threshold doesn't seem quite right, in the removal phase, when you have 10000000 items, the number of buckets drops to 133169152, but the load is 0.075. When all items are removed, the bucket count is not deducted. |
There was a slight issue with the ordering of when I was printing things.
(Without halving):
|
I get something like this if I end up calling the
The num buckets stops at 33292288 since it only halves once at most before every iteration. |
If I incorporate the halving into the
|
Result of gprof:
with this output:
Insertion Phase:
Removal Phase:
|
I observed a significant slowdown toward the end of a run with 880K tasks in total. With approximately 90% of tasks completed, 1000 available cores, fewer than 100 tasks running concurrently, and thousands of tasks remaining in the ready queue, the system became sluggish.
The progress bar advanced very slowly, only one or two tasks completed every 10 seconds, compared to the initial stage where hundreds of tasks completed every second.
To investigate, I used
perf
to monitor which function the manager spent most of its time on. The commands were:And here is the output:
It looks that the hash table iterations became extremely expensive when the table had served for a large number of elements.
My hypothesis was that the table capacity gets expanded every time the threshold is triggered. Elements come and go over time, but the table is never shrunk. As a result, after serving for numerous elements, the number of valid elements in the hash table is small, but the table itself becomes very large due to previous expansions, resulting in the final iterations very time-consuming.
I implemented a shrinking strategy to clean up redundant capacity when the number of elements is low. Combined with some other strategies, I saw an improved performance, but not very sure whether this had a positive impact.
I need to write a separate program to test whether the hash table becomes sluggish under large-scale scenarios.
The text was updated successfully, but these errors were encountered: