Improve pymepix pipeline performance (centroiding) #14

BenMoon · 2021-07-25T20:21:30Z

As it turns out, the centroiding, as expected, in pymepix becomes a bottleneck if run with too much data. We measured the time the clustering and centroiding takes in turn. Clustering needs much longer (factor 100: 0.2 - 0.75 s for clustering, 0.002-0.006 s for centroiding; for approximately 15,000 voxels per trigger, with 5898 triggers in the complete dataset) but it's not fast enough to empty the queue of data coming from the packetprocessor and data piles up. With a single centroiding process and the dataset described above at 100 Hz a queue of about 72 packets builds up (approximately one packet per second - processing must be double as fast in this case to process all packets in time).

For now, the number of data chunks being sent to the packetprocessor is reduced to a third. The number of ions per shot is around 1, and we run at 200 Hz.

I can think of a couple of options to improve the performance of the centroiding:

It should be possible to start multiple instances of the centroiding
Providing n_jobs to dbscan doesn't work at the moment, we submitted an issue with joblib (Loky-backed parallel loops cannot be called in a multiprocessing, setting n_jobs=1 warning with scikit-learn joblib/joblib#1204)
Parallelise dbscan in another way but with n_jobs
Replace scikit-learn.dbscan with a compiled version
combine 3 and 4
add centroiding to clustering in a compiled version like 5

Option 1 should be very straight forward and worthwhile testing if it solves the immediate bottleneck for now:
pymepix/pymepix/processing/acquisition.py line 141: self.addStage(4, Centroiding) → self.addStage(4, Centroiding, num_processes=10)

As for option 2, let's see what they come up with.

For option 3 I see a lot of potential, if done right. In principle, allows the structure of our trivial parallelisation of the cluster finding. The danger here is, that if dbscan is performed over each trigger number like such:

unique_trig_nr = unique(trg_nr)
for trig_num in unique_trig_nr:
  idx = where(trigger_numbers == trig_num)
  data = [x[idx], y[idx], tof[idx]]
  dbscan(...).fig(data)
...

the time dbscan actually will take with respect to the time, the data is sliced, the function is called, data is serialised and so on and so forth, is much too short, and we would produce a tremendous overhead. So the time the computation of the clusters need, has to take much longer as compared to all the overhead a parallelisation brings.
The way we could achieve this, is to leave the clustering as it is, but add to the centroiding.py a worker pool which can perform the clustering on a "chunk" basis and maybe even do the centroiding in an asynchronous fashion. Naively this should scale linearly, so with a pool of 10 workers we can expect an improvement of a factor of 10x.
Need to look into joblib and loki some more to understand the possibilities of creating such a worker pool...
→ After a bit of reading, multiprocessing.pool is probably the best bet here.

scikit-learn.dbscan seems to be a pure Python implementation (https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/cluster/_dbscan.py). However, apparently it gets compiled with Cython, though I don't quite understand by looking at dbscan how. If it doesn't get compiled, we can expect a significant acceleration (again if the overhead of context switching and potential memory copy isn't too big...), if it does get compiled, we won't.... E.g. https://github.com/s1998/parallel-DBSCAN claims a speed-up of 13x on a 36 core machine as compared to scikit-learn.dbscan (frankly, if comparing scikit-learn.dbscan with a C++ implementation running on a 36 cores only get 13 time faster, dbscan is already pretty efficient). Btw, also numba on that end probably doesn't help, although testing it isn't much effort, and we would know for sure...

The lowest hanging fruit is option 1 and the biggest difference should make option 3. If still necessary, switching to a compiled version of dbscan should be a transparent change but as we already run the clustering in a pool of workers at that time, much gain in performance is not expected (maybe a factor of 5 or 10 at the best).

The text was updated successfully, but these errors were encountered:

troehling · 2021-07-26T09:18:07Z

fill ??? with values (under which conditions, statistics from the dataset)
- tested dataset:
  - approximately 15,000 voxels per trigger, overall 5898 triggers in the dataset
  - packages send with a rate of 100 Hz
- With this data rates the queues size will be about 72 packages after 60s of data is received
- one package is lost every second and can not be processed fast enough
show indicator in the GUI if queue gets larger/too large
- qsize does not work under MacOS try-catch or query OS
- The current size of the input queue of the centroiding is now displayed in the processing tab
- if the size exceeds the limit of currently 30 packages in the queue, a warning is displayed to the user
- The warning states the following: "The queue size has passed the warning size of 30 packages with 33 packages in the queue. The data can not be processed at the rate of arrival. Please consider increasing the number of processes allocated to the centroiding, reducing the data frequency or increase the packages skipped for live processing. The current queue size can be checked in the processing tab. This can have a significant impact on the recorded data! Possible loss of data!"
increase number of processes statically
- The number of processes for centroiding is increased to 25 right now
- this should be enough right now
- For higher data rates this might need to be increased even further
- @sebastian-trippel suggested to use as many cores for this as possible. @BenMoon assumes this might have a higher overhead due to blocking when reading from the queue.
increase number of processes dynamically
- If a queue builds up increase the number of processes dynamically (scale-up)
- open question is how to scale down in an intelligent way
remove from UDPSampler that only 1/3 of data is sent (revert UDPSampler)
Fix Paketprocessor throws exception if amount of triggers is not enough #13 (current work around is replacement of break with pass to keep packetProcessor alive)
- Currently the break (which stops the package processor) is still existing
- In addition to the error message, the stack trace is added to make maintenance easier
Run DBSCAN with different number of n_jobs (to compare runtime)
- The runtime-improvement with n_jobs is way lower than the improvement that can be achieved by using multiple instances of DBSCAN in parallel
- As we have independent packages (trigger frames) that can be processed independently, it is better not to use n_jobs
- The question is whether the performance can be improved by varying the number of triggers that is processed together

- Print stacktrace in case of an error for better maintainability

…s in pymepix viewer

…amples value

BenMoon added the enhancement New feature or request label Jul 25, 2021

troehling self-assigned this Jul 26, 2021

troehling mentioned this issue Jul 28, 2021

Improve pymepix pipeline performance (centroiding) CFEL-CMI/pymepix-viewer#5

Closed

troehling pushed a commit that referenced this issue Jul 28, 2021

#14 Calculate timings of clustering and centroiding with out integration

a66aaa4

troehling pushed a commit that referenced this issue Jul 28, 2021

#14 Error can break packageProcessor

a038f16

- Print stacktrace in case of an error for better maintainability

troehling pushed a commit that referenced this issue Sep 9, 2021

#14 Fix event_window and handle_events konfiguration. See also Change…

de2c427

…s in pymepix viewer

troehling pushed a commit that referenced this issue Sep 9, 2021

#14 Fix initialization of event_window, initialize with smaller min_s…

be69b8f

…amples value

troehling closed this as completed Sep 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve pymepix pipeline performance (centroiding) #14

Improve pymepix pipeline performance (centroiding) #14

Uh oh!

Improve pymepix pipeline performance (centroiding) #14

Improve pymepix pipeline performance (centroiding) #14

Comments

Uh oh!

Uh oh!

Uh oh!