8000 Improve pymepix pipeline performance (centroiding) · Issue #14 · CFEL-CMI/pymepix · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
This repository was archived by the owner on Mar 25, 2024. It is now read-only.

Improve pymepix pipeline performance (centroiding) #14

Closed
BenMoon opened this issue Jul 25, 2021 · 1 comment
Closed

Improve pymepix pipeline performance (centroiding) #14

BenMoon opened this issue Jul 25, 2021 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@BenMoon
Copy link
Contributor
BenMoon commented Jul 25, 2021

As it turns out, the centroiding, as expected, in pymepix becomes a bottleneck if run with too much data. We measured the time the clustering and centroiding takes in turn. Clustering needs much longer (factor 100: 0.2 - 0.75 s for clustering, 0.002-0.006 s for centroiding; for approximately 15,000 voxels per trigger, with 5898 triggers in the complete dataset) but it's not fast enough to empty the queue of data coming from the packetprocessor and data piles up. With a single centroiding process and the dataset described above at 100 Hz a queue of about 72 packets builds up (approximately one packet per second - processing must be double as fast in this case to process all packets in time).

For now, the number of data chunks being sent to the packetprocessor is reduced to a third. The number of ions per shot is around 1, and we run at 200 Hz.

I can think of a couple of options to improve the performance of the centroiding:

  1. It should be possible to start multiple instances of the centroiding
  2. Providing n_jobs to dbscan doesn't work at the moment, we submitted an issue with joblib (Loky-backed parallel loops cannot be called in a multiprocessing, setting n_jobs=1 warning with scikit-learn joblib/joblib#1204)
  3. Parallelise dbscan in another way but with n_jobs
  4. Replace scikit-learn.dbscan with a compiled version
  5. combine 3 and 4
  6. add centroiding to clustering in a compiled version like 5

Option 1 should be very straight forward and worthwhile testing if it solves the immediate bottleneck for now:
pymepix/pymepix/processing/acquisition.py line 141: self.addStage(4, Centroiding)self.addStage(4, Centroiding, num_processes=10)

As for option 2, let's see what they come up with.

For option 3 I see a lot of potential, if done right. In principle, allows the structure of our trivial parallelisation of the cluster finding. The danger here is, that if dbscan is performed over each trigger number like such:

unique_trig_nr = unique(trg_nr)
for trig_num in unique_trig_nr:
  idx = where(trigger_numbers == trig_num)
  data = [x[idx], y[idx], tof[idx]]
  dbscan(...).fig(data)
...

the time dbscan actually will take with respect to the time, the data is sliced, the function is called, data is serialised and so on and so forth, is much too short, and we would produce a tremendous overhead. So the time the computation of the clusters need, has to take much longer as compared to all the overhead a parallelisation brings.
The way we could achieve this, is to leave the clustering as it is, but add to the centroiding.py a worker pool which can perform the clustering on a "chunk" basis and maybe even do the centroiding in an asynchronous fashion. Naively this should scale linearly, so with a pool of 10 workers we can expect an improvement of a factor of 10x.
Need to look into joblib and loki some more to understand the possibilities of creating such a worker pool...
→ After a bit of reading, multiprocessing.pool is probably the best bet here.

scikit-learn.dbscan seems to be a pure Python implementation (https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/cluster/_dbscan.py). However, apparently it gets compiled with Cython, though I don't quite understand by looking at dbscan how. If it doesn't get compiled, we can expect a significant acceleration (again if the overhead of context switching and potential memory copy isn't too big...), if it does get compiled, we won't.... E.g. https://github.com/s1998/parallel-DBSCAN claims a speed-up of 13x on a 36 core machine as compared to scikit-learn.dbscan (frankly, if comparing scikit-learn.dbscan with a C++ implementation running on a 36 cores only get 13 time faster, dbscan is already pretty efficient). Btw, also numba on that end probably doesn't help, although testing it isn't much effort, and we would know for sure...

The lowest hanging fruit is option 1 and the biggest difference should make option 3. If still necessary, switching to a compiled version of dbscan should be a transparent change but as we already run the clustering in a pool of workers at that time, much gain in performance is not expected (maybe a factor of 5 or 10 at the best).

@BenMoon BenMoon added the enhancement New feature or request label Jul 25, 2021
@troehling
Copy link
Contributor
troehling commented Jul 26, 2021
  • fill ??? with values (under which conditions, statistics from the dataset)
    • tested dataset:
      • approximately 15,000 voxels per trigger, overall 5898 triggers in the dataset
      • packages send with a rate of 100 Hz
    • With this data rates the queues size will be about 72 packages after 60s of data is received
    • one package is lost every second and can not be processed fast enough
  • show indicator in the GUI if queue gets larger/too large
    • qsize does not work under MacOS try-catch or query OS
    • The current size of the input queue of the centroiding is now displayed in the processing tab
    • if the size exceeds the limit of currently 30 packages in the queue, a warning is displayed to the user
    • The warning states the following: "The queue size has passed the warning size of 30 packages with 33 packages in the queue. The data can not be processed at the rate of arrival. Please consider increasing the number of processes allocated to the centroiding, reducing the data frequency or increase the packages skipped for live processing. The current queue size can be checked in the processing tab. This can have a significant impact on the recorded data! Possible loss of data!"
  • increase number of processes statically
    • The number of processes for centroiding is increased to 25 right now
    • this should be enough right now
    • For higher data rates this might need to be increased even further
    • @sebastian-trippel suggested to use as many cores for this as possible. @BenMoon assumes this might have a higher overhead due to blocking when reading from the queue.
  • increase number of processes dynamically
    • If a queue builds up increase the number of processes dynamically (scale-up)
    • open question is how to scale down in an intelligent way
  • remove from UDPSampler that only 1/3 of data is sent (revert UDPSampler)
  • Fix Paketprocessor throws exception if amount of triggers is not enough #13 (current work around is replacement of break with pass to keep packetProcessor alive)
    • Currently the break (which stops the package processor) is still existing
    • In addition to the error message, the stack trace is added to make maintenance easier
  • Run DBSCAN with different number of n_jobs (to compare runtime)
    • The runtime-improvement with n_jobs is way lower than the improvement that can be achieved by using multiple instances of DBSCAN in parallel
    • As we have independent packages (trigger frames) that can be processed independently, it is better not to use n_jobs
    • The question is whether the performance can be improved by varying the number of triggers that is processed together

@troehling troehling self-assigned this Jul 26, 2021
troehling pushed a commit that referenced this issue Jul 28, 2021
- Print stacktrace in case of an error for better maintainability
troehling pushed a commit that referenced this issue Sep 9, 2021
troehling pushed a commit that referenced this issue Sep 9, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants
0