Initialise agglomerateByModule #744

RiboRings · 2025-06-27T08:40:35Z

Basic agglomerateByModule function leveraging cross-product was initialised for SummarizedExperiment class.

Consider whether we should:

Add separate method for TreeSE
Add support for sample modules in colData
Integrate more with existing agglomerate and merge functions

RiboRings · 2025-06-27T08:42:50Z

Example usage:

devtools::load_all()

data("GlobalPatterns")
tse <- GlobalPatterns

# Add some random modules
N_module <- 3L
modules <- sample(c(TRUE, FALSE),
                  size = nrow(tse) * N_module,
                  prob = c(0.2, 0.8),
                  replace = TRUE)

modules <- modules |> matrix(nrow = nrow(tse))
colnames(modules) <- paste0("module_", seq_len(ncol(modules)))
rowData(tse) <- cbind(rowData(tse), modules)

# Extract module columns
module_columns <- grep("module_", colnames(rowData(tse)), value = TRUE)

# Agglomerate based on modules
tse_module <- agglomerateByModule(tse, by = 1, group = module_columns)

antagomir · 2025-06-28T07:33:10Z

Seems good. Some notes:

**Add separate method for TreeSE: ** this is feasible for agglomerateByRanks where the grouping can follow phylogenetic tree. For modules the tree aggomeration might be ill-defined in many cases? Is it meaningful to do that for modules in the general case (or are there relevant special cases to support?). This can be added later if this requires more thinking.

**Add support for sample modules in colData: ** ideally this method could be symmetric and could be applied for rows and cols alike. We also have other such functions and that should be the default.

Integrate more with existing agglomerate and merge functions: Is there a fundamental difference compared to agglomerateByVariable? If not, we might like to combine these two. It might currently operate only with one column at the time but that is minor difference and just a special case of multi-column thing.

RiboRings · 2025-06-30T06:10:45Z

Add separate method for TreeSE I agree with you, so it seems preferable to downgrade TreeSE to SE and remove trees when agglomerating by module.

Add support for sample modules in colData: Now it works for both features and samples.

Integrate more with existing agglomerate and merge functions: I came to notice a subtle difference between the two. While agglomerateByVariable takes factor variables as groups, agglomerateByModule requires one or more binary (0/1, F/T) variables. Therefore, the two would be analogous only if factor variables were first one-hot encoded and used to agglomerate, but this does not sound like a convenient solution for factors with many levels. So in general, I would keep them as they are (partially integrated) and recommend agglomerateByVariable for single factor groups and agglomerateByModule for one or multiple logical groups.

RiboRings · 2025-07-01T10:29:33Z

This PR is ready for review

TuomasBorman

Looks good!

I guess this is more efficient than implementation in https://github.com/microbiome/mia/tree/agglomerateByModule (i.e., the branch can be removed)?

R/AllGenerics.R

TuomasBorman · 2025-07-01T15:35:14Z

DESCRIPTION

@@ -71,6 +71,7 @@ Imports:
    dplyr,
    IRanges,
    MASS,
    Matrix,


Can we avoid this dependency? You could use DelayedArray and crossprod()

So I looked a bit into this. Basically Matrix provides an optimised method for crossprod and sparse array for large adjacency matrices. This functionality is absent in DelayedArray, which focuses on delaying (instead of optimising) matrix operations.

In a nutshell, we have two options:

downgrade to base matrix and crossprod if we prioritise reducing package dependencies

use Matrix crossprod if we prioritise operations with optimised efficiency

It's a tough choice. Could depend on how much that performance gain would be in (2) in practice.

For a sparse Matrix of 1000 features by 1000 modules with 80% zeros, the time is about 0.1 s, or 1/10 compared to using base matrix and crossprod. The memory usage is 4 Mb instead of 12 Mb. Efficiency further improves for higher percentages of zeros, whereas it stays the same for base matrix.

Reference: https://alfurka.github.io/2022-08-12-faster-matrix-computation-with-r-using-sparsematrix/

Storing the modules in a separate place (e.g., metadata) might allow us to get rid of Matrix dependency, since the user would be able to convert modules to sparse array by themselves, and the agglomerateByModule function could be adjusted to support both Matrix and base matrix without importing it. But I would need to test if this can actually work.

R/agglomerate.R

TuomasBorman · 2025-07-01T15:49:34Z

R/merge.R

-.sum_counts_accross_features_na <- function(x, average, ids, ...){
-    # Which cell is not NA?
-    is_not_na <- !is.na(x)
-    type(is_not_na) <- "integer"
-    # Aggregate data to certain groups
-    x <- rowsum(x, ids, na.rm = TRUE)
-    # Calculate average if specified
-    if( average ){
-        x <- x/rowsum(is_not_na, ids)
+#' @importFrom scuttle sumCountsAcrossFeatures
+.agglomerate_assay <- function(
+    assay.type, assay, by, ids, na.rm, average, BPPARAM
+    ){
+    # Check assay
+    .check_assay_for_merge(assay.type, assay)
+    # Transpose if we are merging columns
+    if( by == 2L ){
+        assay <- t(assay)
    }
-    return(x)


Is there a reason why the functionality from this function is moved? It is better to modify only necessary lines to improve clarity and maintainability while minimizing the risk of introducing unintended changes

Sorry for the extra change. I was trying to make .merge_rows_or_columns more efficient, and give it a similar structure of .agglomerate_module_assay (which in my opinion is also clearer).

The code for .merge_rows_or_columns was using lapply for each of the 4 implemented stages of assay-wise operation (checking, transposing, merging and transposing back). For a clearer and more efficient syntax, I reduced the 4 lapply calls to a single streamlined (m)apply call, which in addition also checks and removes NAs for each assay (not only the first one).

R/agglomerate.R

antagomir · 2025-07-01T19:18:36Z

Still on this point:

Integrate more with existing agglomerate and merge functions:

agglomerateByVariable takes factor variables as groups
agglomerateByModule requires one or more binary (0/1, F/T) variables

Your argument was that factors with multiple levels should be one-hot-encoded to use the same.

But on the other hand, isn't it possible to use agglomerateByVariable for binary variables, and then it will essentially reduce to the same than agglomerateByModule? If so, agglomerateByModule would not have added value because same could be obtained with agglomerateByVariable applied on binary variables/factors/logicals/numerics.

But am I missing sometihing?

RiboRings · 2025-07-01T20:21:54Z

Several binary modules can be reduced to a single factor column as long as each feature belongs to only one module. Otherwise, the module matrix is necessary, in which case using agglomerateByVariable for each module would return one agglomerated SE with two rows (module = TRUE and module = FALSE) for each and every module, whereas agglomerateByModule would return one agglomerated SE with all modules as rows.

antagomir · 2025-07-01T20:39:32Z

Yes indeed, unless we consider the case where the variable in agglomerateByVariable contains NAs, which will not be agglomerated. Binary module encoded with NA + TRUE could result in a single row per module. This is more unclear way to implement something like that.

Another point is that agglomerateByVariable has been written to collapse rows (or cols) based on a single variable (e.g. factor with levels). It could be extended to cover case with multiple variables but this would more intuitively be a set of collapsed matrices with varying dimensionality (which would be determined based on the number of factor levels per each variable). Matrices with varying dimensionality could go into altExps. This could make sense in principle but it would indeed be a different method than agglomerateByModule.

Initialise agglomerateByModule

9b185e1

RiboRings requested a review from TuomasBorman June 27, 2025 08:40

RiboRings added 5 commits June 29, 2025 08:43

Revisit merge cols or rows function

fd8e184

Improve agglomerateByModule

0073374

Use sparse array in agglomerate by module

a50f5e7

Fix scuttle warning

77802e7

Add back line accidentally deleted

df512b0

Add examples and unit tests for agglomerateByModule

c5c5790

TuomasBorman requested changes Jul 1, 2025

View reviewed changes

RiboRings added 4 commits July 2, 2025 15:32

Implement feedback

e9ff8e0

Remove MATRIX dep for agglomerateByModule

789524b

Merge branch 'devel' into giulio

df2bfef

Update agglomerateByModule to work with metadata

6485437

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Initialise agglomerateByModule #744

Initialise agglomerateByModule #744

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Initialise agglomerateByModule #744

Are you sure you want to change the base?

Initialise agglomerateByModule #744

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!